In This Section

Overview
Goals of Affinitization
Preparing For Affinitization
Basic Affinitization (Approach 1)
- Launching With Basic Affinitization
- Validating Basic Affinization
Advanced Affinitization (Approach 2)

Overview

This section presumes that you have already read the section on Understanding Threading.

To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual CPU cores. The sections below will present two approaches for affinitizing your application, but let's first review some nomenclature and concepts before diving into the details.

CPU Socket - refers to a physical connector on a motherboard that accepts a single physical chip. It is commonplace for modern CPUs to provide multiple physical cores which are exposed to the operating system as logical CPUs that can perform parallel execution streams. This document refers to a socket and physical CPU synonymously. See Also: CPU Socket
NUMA - Non-niform Memory Access, refers to the commonplace architecture in which machines with multiple CPU sockets divide the memory banks of RAM into nodes on a per-socket basis. Access to memory on a socket's "local" memory node is faster than accessing memory on a remote node tied to a different socket. See Also: Numa
CPU Core - Contemporary CPUs are likely to run multiple cores which are exposed to the underlying OS as a CPU. See Also: Multi-core processing
Hyper-threading - Intel technology to make a single core appear logically as multiple cores on the same chip to improve the performance
Logical CPU: What the operating system sees as a CPU. The number of CPUs available to the OS is <num sockets> * <cores per socket> * <hyper threads per core>.
Processor Affinity - Refers to the act of restricting the set of logical CPUs on which a particular program thread can execute.

Benefits of thread affinitization

Pinning a thread to a particular CPU ensures that the OS won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from main memory which results in jitter. When all critical threads in the processing pipeline are pinned to their own CPU and busy spinning, the OS scheduler is less likely to schedule another thread onto that core, keeping the threads' processor caches hot.

NUMA Considerations

As an in-memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumann Bottleneck perspective, the platform will operate best when the processor is as close to its in-memory state as possible. On NUMA machines, this means pinning the application memory used to the closest processor. The goals for NUMA affinitization are to:

Allocate all of a process's memory on a single NUMA node.
Pin all threads to cores that are connected to that NUMA node.

With the above:

All threads execute on a core closest to the memory allocated to the process, decreasing memory access time for a thread to access state from main memory.
When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance between the sending and receiving thread as those threads will share the same L3 cache.

Hyper-Threading Considerations

When a machine has enough CPU capacity, it is generally best to disable hyper-threading. When hyper-threading is enabled each logical CPU will share the physical core's CPU caches, which means that there is less cache space available per logical process which results in higher memory access time. Consult with your machine's BIOS documentation to determine how to configure HyperThreading if it is available for your machine.

Goals of Affinitization

With the above overview in mind, the best performance for an application will be achieved when:

All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
Memory for the process is affinitized to the above socket (to avoid remote NUMA node access),
Critical threads are pinned to their own CPU core and are set to busy spin rather than sleep or block (to avoid context switches),
Optimally, Hyper-Threading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread).

Preparing For Affinitization

To get the most out of affinitization, each busy spinning thread should be pinned to its own CPU core which prevents the operating system from relocated the thread to another logical CPU while the program is executing. In order to affinitize your application, you first need to determine what threads in the application are busy spinning and determine your machine's CPU layout to effectively affinitize them.

Identify Busy Spinning Threads

Any platform threads that are called out as critical in Understanding Threading should be affinitized and you will want to determine if your machine has enough CPUs on a single NUMA node such that all threads can be colocated on one node. An easy way to see what threads are busy spinning is to enable XVM thread stats and trace.

Assuming you have enough CPUs on your machine such that two critical threads aren't scheduled on the same CPU, any thread that is consistently using over >90% CPU while your application is not processing messages is one that will benefit from affinization. Determining the number of busy spinning threads will allow you to determine if it is possible to pin them all to processors on the same NUMA node.

Determining CPU Layout

CPU layout is machine dependent. Before configuring CPU affinity masks, it is necessary to determine the CPU layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:

which will produce output similar to the following

0: CpuInfo{socketId=0, coreId=0, threadId=0}
1: CpuInfo{socketId=1, coreId=0, threadId=0}
2: CpuInfo{socketId=0, coreId=8, threadId=0}
3: CpuInfo{socketId=1, coreId=8, threadId=0}
4: CpuInfo{socketId=0, coreId=2, threadId=0}
5: CpuInfo{socketId=1, coreId=2, threadId=0}
6: CpuInfo{socketId=0, coreId=10, threadId=0}
7: CpuInfo{socketId=1, coreId=10, threadId=0}
8: CpuInfo{socketId=0, coreId=1, threadId=0}
9: CpuInfo{socketId=1, coreId=1, threadId=0}
10: CpuInfo{socketId=0, coreId=9, threadId=0}
11: CpuInfo{socketId=1, coreId=9, threadId=0}
12: CpuInfo{socketId=0, coreId=0, threadId=1}
13: CpuInfo{socketId=1, coreId=0, threadId=1}
14: CpuInfo{socketId=0, coreId=8, threadId=1}
15: CpuInfo{socketId=1, coreId=8, threadId=1}
16: CpuInfo{socketId=0, coreId=2, threadId=1}
17: CpuInfo{socketId=1, coreId=2, threadId=1}
18: CpuInfo{socketId=0, coreId=10, threadId=1}
19: CpuInfo{socketId=1, coreId=10, threadId=1}
20: CpuInfo{socketId=0, coreId=1, threadId=1}
21: CpuInfo{socketId=1, coreId=1, threadId=1}
22: CpuInfo{socketId=0, coreId=9, threadId=1}
23: CpuInfo{socketId=1, coreId=9, threadId=1}

In the above, we can see:

The machine has 24 logical CPUs (0 through 23).
There are two processor sockets (socketId=0, socketId=1).
There are 12 physical cores total ... 6 physical cores per socket (coreIds 0, 1, 2, 8, 9 and 10).
Hyper-threading is enabled and there are two threads per socket (threadId=0, threadId=1).

Note that the fashion in which the OS assigns core numbers is OS dependent.

UtlThread available on Linux only

The UtlThread class is only supported on Linux currently. Eventually, support for other platforms will be added.

Best Practices

Before launching your process, validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.
Check what other processes on the host will use busy spinning and find out the cores they will use.
In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible
When feasible it is best to disable hyperthreading to maximize the amount of CPU cache available to each CPU.

Basic Affinitization (Approach 1)

The basic affinization approach requires no additional DDL configuration and simply uses the numactl command to restrict the NUMA memory nodes and logical CPUs on which your application can execute. Using this approach can be a good first step in evaluating the performance benefits of affinitizing your application, but is not ideal for reducing jitter.

Pros

Simple
Avoids remote NUMA node access

Cons

Does not prevent thread context switches, the OS is free to move threads between logic CPUs which leads to jitter.
Does not provide visibility into what CPU a thread is running on, making to harder to diagnose cases where 2 critical threads are scheduled on the same core.

Launching With Basic Affinitization

In the CPU layout determined above, one could launch an application with memory pinned to NUMA node 1, and only CPUs from socket 1 as follows:

Refer to your numactl manual pages for more information.

Validating Basic Affinization

With basic affinization, it isn't straightforward to determine what CPUs any particular thread ends up running on, but you can use a command like top to validate that all of your application threads are running on the expected nodes, by pressing the '1' key after launching top. With enough effort, it may be possible to correlate the thread IDs displayed in a stack dump to those shown in a tool such as htop, but that is outside the scope of this document.

Advanced Affinitization (Approach 2)

For applications that are most concerned with reducing jitter. The basic affinitization approach described above still leaves open the potential for the operating system relocating your threads from one CPU to another which can lead to latency spikes. With the advanced affinitization approach described here, you will avoid this by configuring each busy spinning or critical thread in the application to its own CPU to avoid context switching.

CPU Affinity Mask Format

Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical CPUs, or a square bracket enclosed comma-separated list enumerating the logical CPUs to which to a thread should be affinitized. The latter format is recommended as it is easier to read.

Examples:

"0" no affinity specified (0x0000)
"[]" no affinity specified
"1" specifies logical CPU 0 (0x0001)
"[0]" specifies logical CPU 0
"4" specifies logical CPU 2 (0x0100)
"[2]" list specifying logical CPU 2
"6" mask specifying logical CPU 1 and 2 (0x0110)
"4294967296" specifies logical CPU 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
"[32]" specifies logical CPU 32
"[1,2]" list specifying logical CPU 1 and 2

Enabling Affinitization

By default, CPU affinization is disabled. To enable it you can set the following env flags in the DDL configuration which

Configuring CPU Affinitizes

Step 1: Configure Default Cpu Affinity Mask

Threads that are critical for reducing application latency and improving throughput are listed below, but not all threads are critical. To prevent non-critical threads from being scheduled on a CPU being used by a critical thread, the platform allows the application to configure one or more 'default' CPUs that non-critical threads can be affinitized to, by setting the 'nv.defaultcpuaffinitymask' environment variable. For example, the platform's statistics collection thread doesn't need its own dedicated CPU to perform its relatively simple tasks of periodically reporting heartbeats. However, we still want to ensure that the operating system doesn't try to schedule it onto the same core as a critical thread, so the platform will affinitize it with the default CPU affinity mask.

Step 2: Configure Critical Platform Threads Affinities

Critical platform related threads are those that have the most impact on latency and performance. When the platform is optimized for latency or throughput these threads will be set to use BusySpinning or Yielding respectively to avoid being context switched. Each of these threads should be assigned its own CPU.

See Appendix: Configuring Critical Thread Affinities, for a listing of these threads and how to configure their affinities.

Step 3: Affinitizing Non-Platform Threads

If your application uses its own threads, they can be affinized as well by using the platform's UtlThread utility class. Non-critical threads that are not busy spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of the platform's thread.

Non-Critical, Non-Spinning Thread

Non-critical threads can be affinitized to the set of default CPU's configured by nv.defaultcpuaffinitymask by calling setDefaultCpuAffinityMask from the thread to be affinitized.

Critical or Busy Threads

Threads, that participate in your transaction's processing flow or are spinning or heavy CPU users should be pinned to their own core so that they don't interfere with affinitized platform threads. For example:

Launching with NUMA Affinitization

Unlike with the Basic Affinitization approach, when all threads have been affinitized to their own core or the default core, it is not strictly necessary to restrict what cores a process operates on, just the memory node to which to restrict the process. In fact, it can even be beneficial to let threads outside the platform or application's control be scheduled on other NUMA nodes.

Validating Affinitization

Via Thread Stats Output

The easiest way to check your work is to enable XVM thread stats. Thread stats are emitted in heartbeats and affinitities can be reported in tools like Lumino. If the XVM is configured to trace thread stats, then thread usage is printed as follows:

You can look for any spinning thread (CPU% at 100) that doesn't have an affinity assigned. This will help you avoid the following pitfalls:

Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId.
Having some other non-X process spinning on one of the coreIds that you’ve affinitized to.
Affinitizing across multiple socketIds (which are on different NUMA nodes can make performance worse).
You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.

The platform outputs thread affinitization using the format like: (aff=[6(s0c9t0)] which can be interpreted as logical CPU 6 which is on socket 0, core 9, thread 0.

Programmatically

...which will dump the affinitization state of all threads affinitized through UtlThread:

Via Trace

The above trace will also be printed by an AepEngine after messaging has been started or alternatively when it assumes a backup role (in most cases all platform threads will have been started by this time).

Limitations

The following limitations apply to thread affinitization support:

Thread affinitization is currently only supporteded on Linux.
Affinitization is limited to being able to affinitize threads to logical cores 0 through 63.
Affinitization of a thread does not reserve the CPU core, just limits the cores on which a thread will execute. This is important because if not all threads are affinitized the OS thread scheduler may schedule another thread on top of a critical thread if CPU resources are scarce.

Appendix: Configuring Critical Thread Affinities

Threads that can/should be affinitized include the following:

Thread	Description
Engine Input Multiplexer	The engine thread that dequeues and dispatch application messages and events. This is the main application thread. The detached threads described outlined below can offload work from this thread which can improve throughput and latencies in your application.
Bus Detached Sender Thread	Each bus configured for your application can optionally be configured to send committed outbound messages on a detached thread. When the bus is configured for detached send, this thread offloads the work of serialization and writing of outbound messages from the engine's input multiplexer which serves as a buffer against spikes caused by message bus flow control. High values in the 'o2p, s, s2w, ws' message bus bindings stats are indicators that a detached bus sender can improve performance.
Store Reader Thread	The IO thread for the store. On a primary instance, this is the thread that dispatches store acknowledgements back into the engine. On a backup, this is the thread that dispatches received replication traffic.
Store Detached Send Thread	When the store is configured for detached send, this thread offloads the work of writing recovery data to the network for backup instances from the engine's input multiplexer which can serve as a buffer against network I/O spikes. High values in the 's2w' (Serialize To Wire) store binding stats is an indicator that a detached store sender can improve performance.
Store Detached Dispatch Thread	When the store is configured for detached dispatch, this thread allows the store reader thread to offload work of dispatching deserialized replication traffic to the engine for processing this is useful in cases where the cost of deserializing replication traffic is high. A high value for the the store binding deserialize stat ('d'), can indicate that setting this property could improve throughput or latency.
Store Detached Persister	When the store is configured for detached persistence, this thread offloads the work of writing recovery data to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. High values in the 'pers' (Persistence) store binding stats is an indicator that a detached store sender can improve performance.
Store Detached ICR	When the store is configured for detached Inter Cluster Replication, this thread offloads the work of writing recovery data to the ICR bus to insulate the engine's input multiplexer from spikes caused by flow control on the ICR bus. High values in the 'icr' (Inter Cluster Replication) store binding stats is an indicator that a detached store icr sender can improve performance.
Detached Inbound Message Logger	When the application is configured with a detached inbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. If your application's inbound message load is not high, a detached inbound message logger may not be needed. The tleg3 transaction latency statistic covers inbound message logging. High values or spikes is an indicator that a detached inbound message logger can help.
Detached Outbound Message Logger	When the application is configured with a detached outbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. If your application's outbound message load is not high, a detached outbound message logger may not be needed. The tleg3 transaction latency statistic covers outbound message logging. High values or spikes is an indicator that a detached inbound message logger can help.

Bus Specific Threads

In addition to the core threads above, some bus binding also support additional thread which may be affinitized.

Solace

Property	Property name	Description
Detached Dispatch Thread	dispatcher_cpu_affinity_mask	When set to true a detached dispatch thread is created that offloads the work of deserializing messages and dispatching from the solace thread reading messages from the wire. For applications in which deserialization and dispatch cost is high, enabling detached dispatch can improve throughput and decrease latency.
Solace Consumer Session Dispatch Thread	consumer_cpu_affinity_mask	This property attempts to affinitize the solace client library's receiver thread for this connections consumer session. For highly active sessions the solace thread can take up a fair amount of CPU and it is on the critical latency path. Because this thread is not under the platform's control the thread is affinitized the first time it calls into the binding.
Solace Producer Session Dispatch Thread	producer_cpu_affinity_mask	This property attempts to affinitize the solace client library's receiver thread for this connections consumer session. Acknowledgements come in on the producer session. This thread is not on the critical latency path, so it doesn't always warrant a core of its own. Because this thread is not under the platform's control the thread is affinitized the first time it calls into the binding
Detached Send Thread	detached_sends_cpu_affinity_mask	Configures the CPU affinity mask for detached sender thread. Enabling the detached send thread is usually needed for talon applications because they can use the engine's bus detached send thread described in the previous section. This property may be useful for applications using the solace binding outside of a Talon application.

Examples of configuring affinities for solace bindings.

Direct Binding

The direct binding allows applications to connect directly to the application via the XVM. When using the direct binding it is possible to set up a dedicated IOThread for accepting and servicing direct connections. The IOThread can be affinitized using the affinity attribute:

The Talon Manual

Additional Links

Overview

Goals of Affinitization

Preparing For Affinitization

Identify Busy Spinning Threads

Determining CPU Layout

Best Practices

Basic Affinitization (Approach 1)

Launching With Basic Affinitization

Validating Basic Affinization

Advanced Affinitization (Approach 2)

CPU Affinity Mask Format

Enabling Affinitization

Configuring CPU Affinitizes

Step 1: Configure Default Cpu Affinity Mask

Step 2: Configure Critical Platform Threads Affinities

Step 3: Affinitizing Non-Platform Threads

Non-Critical, Non-Spinning Thread

Critical or Busy Threads

Launching with NUMA Affinitization

Validating Affinitization

Via Thread Stats Output

Programmatically

Via Trace

Limitations

Appendix: Configuring Critical Thread Affinities

Bus Specific Threads

Solace

Direct Binding

The Talon Manual

Additional Links

Tuning Thread Affinitization and NUMA

Overview

Goals of Affinitization

Preparing For Affinitization

Identify Busy Spinning Threads

Determining CPU Layout

Best Practices

Basic Affinitization (Approach 1)

Launching With Basic Affinitization

Validating Basic Affinization

Advanced Affinitization (Approach 2)

CPU Affinity Mask Format

Enabling Affinitization

Configuring CPU Affinitizes

Step 1: Configure Default Cpu Affinity Mask

Step 2: Configure Critical Platform Threads Affinities

Step 3: Affinitizing Non-Platform Threads

Non-Critical, Non-Spinning Thread

Critical or Busy Threads

Launching with NUMA Affinitization

Validating Affinitization

Via Thread Stats Output

Programmatically

Via Trace

Limitations

Appendix: Configuring Critical Thread Affinities

Bus Specific Threads

Solace

Direct Binding