In This Section
Overview
This section presumes that you have already read the section on Understanding Threading.
To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual CPU cores. The sections below will present two approaches for affinitizing your application, but let's first review some nomenclature and concepts before diving into the details.
CPU Socket - refers to a physical connector on a motherboard that accepts a single physical chip. It is commonplace for modern CPUs to provide multiple physical cores which are exposed to the operating system as logical CPUs that can perform parallel execution streams. This document refers to a socket and physical CPU synonymously. See Also: CPU Socket
NUMA - Non-niform Memory Access, refers to the commonplace architecture in which machines with multiple CPU sockets divide the memory banks of RAM into nodes on a per-socket basis. Access to memory on a socket's "local" memory node is faster than accessing memory on a remote node tied to a different socket. See Also: Numa
CPU Core - Contemporary CPUs are likely to run multiple cores which are exposed to the underlying OS as a CPU. See Also: Multi-core processing
Hyper-threading - Intel technology to make a single core appear logically as multiple cores on the same chip to improve the performance
Logical CPU: What the operating system sees as a CPU. The number of CPUs available to the OS is <num sockets> * <cores per socket> * <hyper threads per core>.
Processor Affinity - Refers to the act of restricting the set of logical CPUs on which a particular program thread can execute.
Benefits of thread affinitization
Pinning a thread to a particular CPU ensures that the OS won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from main memory which results in jitter. When all critical threads in the processing pipeline are pinned to their own CPU and busy spinning, the OS scheduler is less likely to schedule another thread onto that core, keeping the threads' processor caches hot.
NUMA Considerations
As an in-memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumann Bottleneck perspective, the platform will operate best when the processor is as close to its in-memory state as possible. On NUMA machines, this means pinning the application memory used to the closest processor. The goals for NUMA affinitization are to:
- Allocate all of a process's memory on a single NUMA node.
- Pin all threads to cores that are connected to that NUMA node.
With the above:
- All threads execute on a core closest to the memory allocated to the process, decreasing memory access time for a thread to access state from main memory.
- When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance between the sending and receiving thread as those threads will share the same L3 cache.
Hyper-Threading Considerations
When a machine has enough CPU capacity, it is generally best to disable hyper-threading. When hyper-threading is enabled each logical CPU will share the physical core's CPU caches, which means that there is less cache space available per logical process which results in higher memory access time. Consult with your machine's BIOS documentation to determine how to configure HyperThreading if it is available for your machine.
Goals of Affinitization
With the above overview in mind, the best performance for an application will be achieved when:
- All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
- Memory for the process is affinitized to the above socket (to avoid remote NUMA node access),
- Critical threads are pinned to their own CPU core and are set to busy spin rather than sleep or block (to avoid context switches),
- Optimally, Hyper-Threading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread).
Preparing For Affinitization
To get the most out of affinitization, each busy spinning thread should be pinned to its own CPU core which prevents the operating system from relocated the thread to another logical CPU while the program is executing. In order to affinitize your application, you first need to determine what threads in the application are busy spinning and determine your machine's CPU layout to effectively affinitize them.
Identify Busy Spinning Threads
Any platform threads that are called out as critical in Understanding Threading should be affinitized and you will want to determine if your machine has enough CPUs on a single NUMA node such that all threads can be colocated on one node. An easy way to see what threads are busy spinning is to enable XVM thread stats and trace.
Assuming you have enough CPUs on your machine such that two critical threads aren't scheduled on the same CPU, any thread that is consistently using over >90% CPU while your application is not processing messages is one that will benefit from affinization. Determining the number of busy spinning threads will allow you to determine if it is possible to pin them all to processors on the same NUMA node.
Determining CPU Layout
CPU layout is machine dependent. Before configuring CPU affinity masks, it is necessary to determine the CPU layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:
which will produce output similar to the following
0: CpuInfo{socketId=0, coreId=0, threadId=0} 1: CpuInfo{socketId=1, coreId=0, threadId=0} 2: CpuInfo{socketId=0, coreId=8, threadId=0} 3: CpuInfo{socketId=1, coreId=8, threadId=0} 4: CpuInfo{socketId=0, coreId=2, threadId=0} 5: CpuInfo{socketId=1, coreId=2, threadId=0} 6: CpuInfo{socketId=0, coreId=10, threadId=0} 7: CpuInfo{socketId=1, coreId=10, threadId=0} 8: CpuInfo{socketId=0, coreId=1, threadId=0} 9: CpuInfo{socketId=1, coreId=1, threadId=0} 10: CpuInfo{socketId=0, coreId=9, threadId=0} 11: CpuInfo{socketId=1, coreId=9, threadId=0} 12: CpuInfo{socketId=0, coreId=0, threadId=1} 13: CpuInfo{socketId=1, coreId=0, threadId=1} 14: CpuInfo{socketId=0, coreId=8, threadId=1} 15: CpuInfo{socketId=1, coreId=8, threadId=1} 16: CpuInfo{socketId=0, coreId=2, threadId=1} 17: CpuInfo{socketId=1, coreId=2, threadId=1} 18: CpuInfo{socketId=0, coreId=10, threadId=1} 19: CpuInfo{socketId=1, coreId=10, threadId=1} 20: CpuInfo{socketId=0, coreId=1, threadId=1} 21: CpuInfo{socketId=1, coreId=1, threadId=1} 22: CpuInfo{socketId=0, coreId=9, threadId=1} 23: CpuInfo{socketId=1, coreId=9, threadId=1}
In the above, we can see:
- The machine has 24 logical CPUs (0 through 23).
- There are two processor sockets (socketId=0, socketId=1).
- There are 12 physical cores total ... 6 physical cores per socket (coreIds 0, 1, 2, 8, 9 and 10).
- Hyper-threading is enabled and there are two threads per socket (threadId=0, threadId=1).
Note that the fashion in which the OS assigns core numbers is OS dependent.
UtlThread available on Linux only
The UtlThread class is only supported on Linux currently. Eventually, support for other platforms will be added.
Best Practices
Before launching your process, validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.
Check what other processes on the host will use busy spinning and find out the cores they will use.
In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible
- When feasible it is best to disable hyperthreading to maximize the amount of CPU cache available to each CPU.
Basic Affinitization (Approach 1)
The basic affinization approach requires no additional DDL configuration and simply uses the numactl command to restrict the NUMA memory nodes and logical CPUs on which your application can execute. Using this approach can be a good first step in evaluating the performance benefits of affinitizing your application, but is not ideal for reducing jitter.
Pros
- Simple
- Avoids remote NUMA node access
Cons
- Does not prevent thread context switches, the OS is free to move threads between logic CPUs which leads to jitter.
- Does not provide visibility into what CPU a thread is running on, making to harder to diagnose cases where 2 critical threads are scheduled on the same core.
Launching With Basic Affinitization
In the CPU layout determined above, one could launch an application with memory pinned to NUMA node 1, and only CPUs from socket 1 as follows:
Refer to your numactl manual pages for more information.
Validating Basic Affinization
With basic affinization, it isn't straightforward to determine what CPUs any particular thread ends up running on, but you can use a command like top to validate that all of your application threads are running on the expected nodes, by pressing the '1' key after launching top. With enough effort, it may be possible to correlate the thread IDs displayed in a stack dump to those shown in a tool such as htop, but that is outside the scope of this document.
Advanced Affinitization (Approach 2)
For applications that are most concerned with reducing jitter. The basic affinitization approach described above still leaves open the potential for the operating system relocating your threads from one CPU to another which can lead to latency spikes. With the advanced affinitization approach described here, you will avoid this by configuring each busy spinning or critical thread in the application to its own CPU to avoid context switching.
CPU Affinity Mask Format
Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical CPUs, or a square bracket enclosed comma-separated list enumerating the logical CPUs to which to a thread should be affinitized. The latter format is recommended as it is easier to read.
Examples:
- "0" no affinity specified (0x0000)
- "[]" no affinity specified
- "1" specifies logical CPU 0 (0x0001)
- "[0]" specifies logical CPU 0
- "4" specifies logical CPU 2 (0x0100)
- "[2]" list specifying logical CPU 2
- "6" mask specifying logical CPU 1 and 2 (0x0110)
- "4294967296" specifies logical CPU 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
- "[32]" specifies logical CPU 32
- "[1,2]" list specifying logical CPU 1 and 2
Enabling Affinitization
By default, CPU affinization is disabled. To enable it you can set the following env flags in the DDL configuration which
Configuring CPU Affinitizes
Step 1: Configure Default Cpu Affinity Mask
Threads that are critical for reducing application latency and improving throughput are listed below, but not all threads are critical. To prevent non-critical threads from being scheduled on a CPU being used by a critical thread, the platform allows the application to configure one or more 'default' CPUs that non-critical threads can be affinitized to, by setting the 'nv.defaultcpuaffinitymask' environment variable. For example, the platform's statistics collection thread doesn't need its own dedicated CPU to perform its relatively simple tasks of periodically reporting heartbeats. However, we still want to ensure that the operating system doesn't try to schedule it onto the same core as a critical thread, so the platform will affinitize it with the default CPU affinity mask.
Step 2: Configure Critical Platform Threads Affinities
Critical platform related threads are those that have the most impact on latency and performance. When the platform is optimized for latency or throughput these threads will be set to use BusySpinning or Yielding respectively to avoid being context switched. Each of these threads should be assigned its own CPU.
See Appendix: Configuring Critical Thread Affinities, for a listing of these threads and how to configure their affinities.
Step 3: Affinitizing Non-Platform Threads
If your application uses its own threads, they can be affinized as well by using the platform's UtlThread utility class. Non-critical threads that are not busy spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of the platform's thread.
Non-Critical, Non-Spinning Thread
Non-critical threads can be affinitized to the set of default CPU's configured by nv.defaultcpuaffinitymask by calling setDefaultCpuAffinityMask from the thread to be affinitized.
Critical or Busy Threads
Threads, that participate in your transaction's processing flow or are spinning or heavy CPU users should be pinned to their own core so that they don't interfere with affinitized platform threads. For example:
Launching with NUMA Affinitization
Unlike with the Basic Affinitization approach, when all threads have been affinitized to their own core or the default core, it is not strictly necessary to restrict what cores a process operates on, just the memory node to which to restrict the process. In fact, it can even be beneficial to let threads outside the platform or application's control be scheduled on other NUMA nodes.
Validating Affinitization
Via Thread Stats Output
The easiest way to check your work is to enable XVM thread stats. Thread stats are emitted in heartbeats and affinitities can be reported in tools like Lumino. If the XVM is configured to trace thread stats, then thread usage is printed as follows:
You can look for any spinning thread (CPU% at 100) that doesn't have an affinity assigned. This will help you avoid the following pitfalls:
Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId.
Having some other non-X process spinning on one of the coreIds that you’ve affinitized to.
Affinitizing across multiple socketIds (which are on different NUMA nodes can make performance worse).
You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.
The platform outputs thread affinitization using the format like: (aff=[6(s0c9t0)] which can be interpreted as logical CPU 6 which is on socket 0, core 9, thread 0.
Programmatically
...which will dump the affinitization state of all threads affinitized through UtlThread:
Via Trace
The above trace will also be printed by an AepEngine after messaging has been started or alternatively when it assumes a backup role (in most cases all platform threads will have been started by this time).
Limitations
The following limitations apply to thread affinitization support:
Thread affinitization is currently only supporteded on Linux.
Affinitization is limited to being able to affinitize threads to logical cores 0 through 63.
Affinitization of a thread does not reserve the CPU core, just limits the cores on which a thread will execute. This is important because if not all threads are affinitized the OS thread scheduler may schedule another thread on top of a critical thread if CPU resources are scarce.
Appendix: Configuring Critical Thread Affinities
Threads that can/should be affinitized include the following:
Thread | Description |
---|---|
Engine Input Multiplexer | The engine thread that dequeues and dispatch application messages and events. This is the main application thread. The detached threads described outlined below can offload work from this thread which can improve throughput and latencies in your application. |
Bus Detached Sender Thread | Each bus configured for your application can optionally be configured to send committed outbound messages on a detached thread. When the bus is configured for detached send, this thread offloads the work of serialization and writing of outbound messages from the engine's input multiplexer which serves as a buffer against spikes caused by message bus flow control. |
Store Reader Thread | The IO thread for the store. On a primary instance, this is the thread that dispatches store acknowledgements back into the engine. On a backup, this is the thread that dispatches received replication traffic. |
Store Detached Send Thread | When the store is configured for detached send, this thread offloads the work of writing recovery data to the network for backup instances from the engine's input multiplexer which can serve as a buffer against network I/O spikes.
|
Store Detached Dispatch Thread | When the store is configured for detached dispatch, this thread allows the store reader thread to offload work of dispatching deserialized replication traffic to the engine for processing this is useful in cases where the cost of deserializing replication traffic is high.
|
Store Detached Persister | When the store is configured for detached persistence, this thread offloads the work of writing recovery data to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. |
Store Detached ICR | When the store is configured for detached Inter Cluster Replication, this thread offloads the work of writing recovery data to the ICR bus to insulate the engine's input multiplexer from spikes caused by flow control on the ICR bus. |
Detached Inbound Message Logger | When the application is configured with a detached inbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. |
Detached Outbound Message Logger | When the application is configured with a detached outbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.
|
Bus Specific Threads
In addition to the core threads above, some bus binding also support additional thread which may be affinitized.
Solace
Property | Property name | Description |
---|---|---|
Detached Dispatch Thread | dispatcher_cpu_affinity_mask | When set to true a detached dispatch thread is created that offloads the work of deserializing messages and dispatching from the solace thread reading messages from the wire.
|
Solace Consumer Session Dispatch Thread | consumer_cpu_affinity_mask | This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.
|
Solace Producer Session Dispatch Thread | producer_cpu_affinity_mask | This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.
|
Detached Send Thread | detached_sends_cpu_affinity_mask | Configures the CPU affinity mask for detached sender thread.
|
Examples of configuring affinities for solace bindings.
Direct Binding
The direct binding allows applications to connect directly to the application via the XVM. When using the direct binding it is possible to set up a dedicated IOThread for accepting and servicing direct connections. The IOThread can be affinitized using the affinity attribute: