Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

Luis Javier Arizmendi Alonso
10 min readMay 29, 2020

--

Overview

Enhanced Platform Awareness (EPA) is a compilation of multiple features that allow efficient utilization of the underlying platform capabilities when deploying deterministic applications. Enhanced Platform Awareness methodology permit to create systems suitable to run platforms like Network Function Virtualization (NFV) and other latency-sensitive environments.

Then, the main focus of EPA is to provide deterministic application performance and comprehends, among others:

  • HugePages usage
  • CPU Pinning
  • NUMA topology awareness
  • SRIOV
  • DPDK
  • Real Time Kernel

Over the last year, more and more latency-sensitive workloads need to be running on OpenShift, so this kind of configuration will be common. You can find how to setup these features in the official OpenShift documentation but sadly all pieces are spread in multiple sections and even some parts are being missed, such isolating CPUs. This series of posts will review how to configure end-to-end EPA support in OpenShift 4 (4.4.4 in this case).

In this Series of Posts we will see how to configure each feature separately but there is a new method to configure all the EPA features based on an operator that will make this configuration much easier, we probably will review this method at the end of these Posts.

I said easier, not because configuring the EPA features is difficult, but because we will be able to do it all at once from a single place. This time we are going to use the current methodology to configure these features one by one, although you can do multiple steps at once since some configurations are made on the same resource objects.

In this first post of the EPA series, I will cover HugePages configuration. If you want to review other parts, go directly to:

Before We Begin, Do you need an EPA “role”?

In the examples shown in these posts, you will see that we configure EPA features in the “worker” role, which means that all nodes will be reconfigured.

In case that you want to deploy a mixed environment (EPA and non-EPA worker nodes) it will be better to create a new role (including all the required labels) by adding either a new machineSet (if you can use Machine API) following the same procedure than the used to create “infra nodes”, or the machineconfigpool and machineconfigs objects manually if you cannot use machineSets.

An example using a machineSet (for OpenStack in this case) could be the one shown below. Pay special attention to the labels, where we indicate that we have a new role (node-role.kubernetes.io/worker-epa: “”) and where we indicate the labels that will be using during the EPA features configuration (hugepages-1g: “true” and cpumanager: “true”)

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
labels:
machine.openshift.io/cluster-api-cluster: ocp-b79gr
machine.openshift.io/cluster-api-machine-role: worker-epa
machine.openshift.io/cluster-api-machine-type: worker-epa
name: ocp-b79gr-worker-epa
namespace: openshift-machine-api
resourceVersion: '208481'
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: ocp-b79gr
machine.openshift.io/cluster-api-machineset: ocp-b79gr-worker-epa
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: ocp-b79gr
machine.openshift.io/cluster-api-machine-role: worker-epa
machine.openshift.io/cluster-api-machine-type: worker-epa
machine.openshift.io/cluster-api-machineset: ocp-b79gr-worker-epa
spec:
metadata:
labels:
node-role.kubernetes.io/worker-epa: ""
hugepages-1g: "true"
cpumanager: "true"
providerSpec:
value:
cloudName: openstack
networks:
- filter: {}
subnets:
- filter:
name: ocp-b79gr-nodes
tags: openshiftClusterID=ocp-b79gr
userDataSecret:
name: worker-user-data
cloudsSecret:
name: openstack-cloud-credentials
namespace: openshift-machine-api
metadata:
creationTimestamp: null
serverMetadata:
Name: ocp-b79gr-worker
openshiftClusterID: ocp-b79gr
securityGroups:
- filter: {}
name: ocp-b79gr-worker
trunk: true
kind: OpenstackProviderSpec
tags:
- openshiftClusterID=ocp-b79gr
image: ocp-b79gr-rhcos
apiVersion: openstackproviderconfig.openshift.io/v1alpha1
flavor: ocp-epa

In case that you cannot use machineSet, you can still create a new role by creating a new machineconfigpool object and “copy-pasting” the machineconfig objects from the worker role and relabeling the nodes.

You can see an example for the infra-node role in this Git repo, it’s more or less the same but changing the role to something like “worker-epa” instead of “infra” (do not forget to add the hugepages-1g: “true” and cpumanager: “true” labels to those nodes too). You can see an example of the three steps below:

1-Create a new MachineConfigPool

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-epa
spec:
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker-epa
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-epa: ""
paused: false

2-”Copy-Paste” the worker role MachineConfigs

Generate the new files coping the worker role machineconfigs:

$ oc get mc 00-worker -o yaml > 00-worker-epa.yaml$ oc get mc 01-worker-container-runtime -o yaml > 01-worker-epa-container-runtime.yaml$ oc get mc 01-worker-kubelet -o yaml > 01-worker-epa-kubelet.yaml$ oc get mc 99-worker-<HEX-VALUE-TO-REPLACE>-registries -o yaml > 99-worker-epa-<HEX-VALUE-TO-REPLACE>-registries.yaml$ oc get mc 99-worker-ssh -o yaml > 99-worker-epa-ssh.yaml

Then remove what is not needed when creating new objects:

$ sed -i \
-e '/annotations/,+1d' \
-e '/creationTimestamp/d' \
-e'/generation/d' \
-e '/ownerReference/,+6d' \
-e '/resourceVersion/d' \
-e '/selfLink/d' \
-e '/uid/ {/data/!d}' \
-e 's/worker/infra/' \
00-worker-epa.yaml
$ sed -i \
-e '/annotations/,+1d' \
-e '/creationTimestamp/d' \
-e'/generation/d' \
-e '/ownerReference/,+6d' \
-e '/resourceVersion/d' \
-e '/selfLink/d' \
-e '/uid/ {/data/!d}' \
-e 's/worker/infra/' \
01-worker-epa-container-runtime.yaml
$ sed -i \
-e '/annotations/,+1d' \
-e '/creationTimestamp/d' \
-e'/generation/d' \
-e '/ownerReference/,+6d' \
-e '/resourceVersion/d' \
-e '/selfLink/d' \
-e '/uid/ {/data/!d}' \
-e 's/worker/infra/' \
01-worker-epa-kubelet.yaml
$ sed -i \
-e '/annotations/,+1d' \
-e '/creationTimestamp/d' \
-e'/generation/d' \
-e '/ownerReference/,+4d' \
-e '/resourceVersion/d' \
-e '/selfLink/d' \
-e '/uid/ {/data/!d}' \
-e 's/worker/infra/' \
99-worker-epa-<HEX-VALUE-TO-REPLACE>-registries.yaml
$ sed -i \
-e '/annotations/,+1d' \
-e '/creationTimestamp/d' \
-e'/generation/d' \
-e '/ownerReference/,+6d' \
-e '/resourceVersion/d' \
-e '/selfLink/d' \
-e '/uid/ {/data/!d}' \
-e 's/worker/infra/' \
99-worker-epa-ssh.yaml

Finally, create new objects:

$ oc create -f 00-worker-epa.yaml$ oc create -f 01-worker-epa-container-runtime.yaml$ oc create -f 01-worker-epa-kubelet.yaml$ oc create -f 99-worker-epa-<HEX-VALUE-TO-REPLACE>-registries.yaml$ oc create -f 99-worker-epa-ssh.yaml

3-Label nodes to use the new role and remove the worker role

$ oc label node <node> node-role.kubernetes.io/worker-epa=$ oc label node <node> node-role.kubernetes.io/worker-

The rest of the labels needed ( hugepages-1g: “true” and cpumanager: “true) could be added during each feature configuration

HugePages

When a process uses memory, the CPU allocates it in chunks, also named pages. The standard page size is 4 kB but larger page sizes (HugePages) are also available. Actually, the usage of Huge pages results in less efficient memory usage (usually deterministic environments are less performant than the non-deterministic ones), since normally the process will not use all memory of a HugePage, but at the same time, it will result in fewer TLB misses (since we probably will have the required memory bits in that HugePage).

Every time that there is a page miss, a new search in memory must be done, with the associated latency and jitter in the overall operation, thus making the process less deterministic. Since we want deterministic operations as part of the EPA methodology the usage of HugePages is recommended.

The HugePages support implies some configuration in the worker nodes, but OpenShift simplifies it by using the Node Tuning Operator.

HugePages configuration

  1. Label your nodes

First of all, let’s select the nodes where we want to configure HugePages support with a label (in this case ‘hugepages-1g’ because I will configure 1GB HugePages):

$ oc label node <node> hugepages-1g=true

You can also use the Web Console instead of ‘oc’ CLI:

2. Modify the Kernel Arguments of the nodes

We want to configure non-default 1GB HugePages (default is 2MB), so we need to configure the kernel Arguments in addition to the Tuned configuration. In order to add a new Kernel Arguments in RHCOS we need to create a new machineConfig object including the changes for nodes with the previously configured label (remember to change “worker” role by “worker-epa” if you created such role):

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 52-kargs-hugepages-1g
spec:
kernelArguments:
- default_hugepagesz=1G
- hugepagesz=1G

If you created a file (mine was ‘52-kargs-hugepages-1g.yaml’) with that config, create the object using ‘oc’:

$ oc create -f 52-kargs-hugepages-1g.yaml

If you don’t want to create a file, you can create the object directly from the Shell:

oc create -f - <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 52-kargs-hugepages-1g
spec:
kernelArguments:
- default_hugepagesz=1G
- hugepagesz=1G
EOF

You can also create objects using the Web Console:

3. Configure the number of reserved HugePages

Once the Kernel Arguments are created, create a Tuned config including the number of pages to be configured (14 in the example below, remember to let some memory free from the total on the node for other workloads).

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: hugepages-1g
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Configuration for hugepages
include=openshift-node
[vm]
transparent_hugepages=never
[sysctl]
vm.nr_hugepages=14
name: node-hugepages-1g
recommend:
- match:
- label: hugepages-1g
priority: 30
profile: node-hugepages-1g

Create the object:

$ oc create -f tuned-hugepages-1g.yaml

Or using the Web Console (in this case you can use the “+” sign to create your object or click in “Explore” and look for the “Tuned” object):

4. Permit “IPC_LOCK” SecurityContext Capability

We need to “relax” the security of our cluster by configuring securityContext allowing IPC_LOCK capability which is required by the application to lock HugePages memory inside the container to prevent it from being swapped to disk.

Otherwise, if you are running as a non-privileged user you can get this error when creating the POD:

pods "hugepages-test" is forbidden: unable to validate against any security context constraint: [capabilities.add: Invalid value: "IPC_LOCK": capability may not be added]

If you want regular users can deploy workloads with HugePages you will have to allow it by either creating a SecrityContextContraint (SCC)and assigning it to the ‘admin’ role (yes, I know that it’s confusing that the “regular” user role is the “admin” role) or including that capability in the default SCC (“restricted”), I will do the latter.

oc edit scc restricted

You have to remove the “null” value in “allowedCapabilities” and add the ‘IPC_LOCK’, something like this:

...
allowPrivilegedContainer: false
allowedCapabilities:
- IPC_LOCK
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
...

HugePages testing

And that’s all, you can test that everything went ok by checking the Kernel Arguments and the allocated Hugepages jumping into the nodes:

$ ssh core@<node>[core@<node> ~]$ cat /proc/cmdlineBOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-28682f47d1b54999d14e778a4bc812b17acbf3a189483eb988336350ca405eb3/vmlinuz-4.18.0-147.8.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu rd.luks.options=discard ostree=/ostree/boot.1/rhcos/28682f47d1b54999d14e778a4bc812b17acbf3a189483eb988336350ca405eb3/0 default_hugepagesz=1G hugepagesz=1G[core@<node> ~]$ cat /proc/meminfo | grep -i hugeAnonHugePages:    163840 kB
ShmemHugePages: 0 kB
HugePages_Total: 14
HugePages_Free: 14
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 14680064 kB

Or using the ‘oc’ client:

$ oc describe node <node> | grep -i huge

You can do it also from the Web Console:

Once you are sure that all is setup in your nodes, you can try to deploy a POD using. HugePages can be consumed using resource limits with the resource name ‘hugepages-<size>’ (the node must support that HugePages size) and a volume including the ‘medium: Hugepages’ (if a single size was configured in the node) or ‘medium: HugePages-<size>’. Don’t forget to include the IPC_LOCK capability.

apiVersion: v1
kind: Pod
metadata:
name: hugepages-test
spec:
containers:
- image: rhel7:latest
securityContext:
capabilities:
add: ["IPC_LOCK"]
command:
- sleep
- inf
name: example
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
hugepages-1Gi: 1Gi
memory: "1Gi"
cpu: "1"
volumes:
- name: hugepage
emptyDir:
medium: HugePages

In the example above, we configured 1GB pages so we will use that size, reserving 1GB of memory (so we will be using 1 page). Only one HugePage size was configured in the node but if we would configure more than one, PODs could use them. You can find some examples in the Kubernetes official documentation.

Create a project and the POD:

$ oc new-project test
$ oc create -f hugepages-test.yaml

Or if you prefer the Web Console:

Continue with CPU pinning in OpenShift:

… or directly with any other part of this series:

--

--

Luis Javier Arizmendi Alonso

I was born some time ago, I’m living daily and, probably, I will eventually die