In This Section
Overview
To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual cpu cores. Generally speaking, the best performance for an application will be achieved when:
- All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
- Memory for the process is affinitized to the above socket (to avoid remote numa node access),
- Critical threads are pinned to their own cpu core and are set to busy spin rather than sleep or block (to avoid context switches),
- Hyperthreading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread).
Benefits of thread affinitization
Pinning a thread to a particular cpu ensures that the os won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from memory and result in jitter. When all critical threads in the processing pipeline are pinned to their own cpu and busy spinning, the os scheduler is less likely to schedule another thread onto that core, keeping the threads processor caches hot.
NUMA Considerations
As an in memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumman Bottleneck perspective, the platform will operate best when the processor is as close to its in memory state as possible. On NUMA (Non Uniform Memory Architectures) machines, this means pinning the application memory used to the closest processor.
The goal of NUMA pinning is to ensure that processor threads are operating on memory that is physically closest to them:
- Making sure threads execute on a core closest to the memory allocated to the process decreases memory access time for a thread to access state from main memory.
- When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance for the receiving thread.
Hyper Threading Considerations
When a machine has enough cpu capacity, it is generally best to disable hyper threading. When hyper threading is enabled each logical cpu will share the physical core's cpu caches, which means that there is less cache space available per logical process.
Configuring Affinitization
To effectively optimize an application via affinitization
Determining CPU Layout
The cpu layout is machine dependent. Before configuring cpu affinity masks, it is necessary to determine the cpu layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:
which will produce output similar to the following
0: CpuInfo{socketId=0, coreId=0, threadId=0} 1: CpuInfo{socketId=1, coreId=0, threadId=0} 2: CpuInfo{socketId=0, coreId=8, threadId=0} 3: CpuInfo{socketId=1, coreId=8, threadId=0} 4: CpuInfo{socketId=0, coreId=2, threadId=0} 5: CpuInfo{socketId=1, coreId=2, threadId=0} 6: CpuInfo{socketId=0, coreId=10, threadId=0} 7: CpuInfo{socketId=1, coreId=10, threadId=0} 8: CpuInfo{socketId=0, coreId=1, threadId=0} 9: CpuInfo{socketId=1, coreId=1, threadId=0} 10: CpuInfo{socketId=0, coreId=9, threadId=0} 11: CpuInfo{socketId=1, coreId=9, threadId=0} 12: CpuInfo{socketId=0, coreId=0, threadId=1} 13: CpuInfo{socketId=1, coreId=0, threadId=1} 14: CpuInfo{socketId=0, coreId=8, threadId=1} 15: CpuInfo{socketId=1, coreId=8, threadId=1} 16: CpuInfo{socketId=0, coreId=2, threadId=1} 17: CpuInfo{socketId=1, coreId=2, threadId=1} 18: CpuInfo{socketId=0, coreId=10, threadId=1} 19: CpuInfo{socketId=1, coreId=10, threadId=1} 20: CpuInfo{socketId=0, coreId=1, threadId=1} 21: CpuInfo{socketId=1, coreId=1, threadId=1} 22: CpuInfo{socketId=0, coreId=9, threadId=1} 23: CpuInfo{socketId=1, coreId=9, threadId=1}
In the above, we can see:
- The machine has 23 logical cpus (0 through 23).
- There are two processor sockets (socketId=1, socketId=2).
- There are 12 physical cores total ... 6 physical cores per socket (coreIds 0, 1, 2, 8, 9 and 10).
- Hyper threading is enabled and there are two threads per socke (threadId=1, threadId=2).
Configuring CPU Affinities
CPU Affinity Mask Format
Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical cpu, or a square bracket enclosing a comma separated list enumerating the logical cpus to which to a thread should be affinitized. The latter format is recommended as it is easier to read.
Examples:
- "0" no affinity specified (0x0000)
- "[]" no affinity specified
- "1" specifies logical cpu 0 (0x0001)
- "[0]" specifies logical cpu 0
- "4" specifies logical cpu 2 (0x0100)
- "[2]" list specifying logical cpu 2
- "6" mask specifying logical cpu 1 and 2 (0x0110)
- "4294967296" specifies logical cpu 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
- "[32]" specifies logical cpu 32
- "[1,2]" list specifying logical cpu 1 and 2
Enabling Affinitization
By default, cpu affinization is disabled. To enable it you can set the following env flags in the DDL configuration:
Affinitizing Critical Threads
Threads that can/should be affinitized include the following:
Thread | Description |
---|---|
Engine Input Multiplexer | The engine thread that dequeues and dispatch application messages and events. This is the main application thread. The detached threads described outlined below can offload work from this thread which can improve throughput and latencies in your application. |
Bus Detached Sender Thread | Each bus configured for your application can optionally be configured to send committed outbound messages on a detached thread. When the bus is configured for detached send, this thread offloads the work of serialization and writing of outbound messages from the engine's input multiplexer which serves as a buffer against spikes caused by message bus flow control. |
Store Reader Thread | The IO thread for the store. On a primary instance this is the thread that dispatches store acknowledgements back into the engine. On a backup this is the thread that dispatches received replication traffic. |
Store Detached Persister | When the store is configured for detached persistence, this thread offloads the work of writing recovery data to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. |
Store Detached Send Thread | When the store is configured for detached send, this thread offloads the work of writing recovery data to the network for backup instances from the engine's input multiplexer which can serve as a buffer against network I/O spikes.
|
Store Detached Dispatch Thread | When the store is configured for detached dispatch, this thread allows the store reader thread to offload work of dispatching deserialized replication traffic to the engine for processing this is useful in cases where the cost of deserializing replication traffic is high.
|
Detached Inbound Message Logger | When the application is configured with a detached inbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.![]() |
Detached Outbound Message Logger | When the application is configured with a detached outbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes. ![]() |
Bus Specific Threads
In addition to the core threads above, some bus binding also support additional thread which may be affinitized.
Solace
Property | Property name | Description |
---|---|---|
Detached Dispatch Thread | dispatcher_cpu_affinity_mask | When set to true a detached dispatch thread is created that offloads the work of deserializing messages and dispatching from the solace thread reading messages from the wire.
|
Solace Consumer Session Dispatch Thread | consumer_cpu_affinity_mask | This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.
|
Solace Producer Session Dispatch Thread | producer_cpu_affinity_mask | This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.
|
Detached Send Thread | detached_sends_cpu_affinity_mask | Configures the cpu affinity mask for detached sender thread.
|
Examples of configuring affinities for solace bindings.
Direct Binding
The direct binding allows applications to connect directly to application via the xvm. In this case it is recommended that
Affinitizing Non Platform Threads
If your application uses its own threads, they can be affinized as well using the platform's UtlThread utility class. Non critical threads that are not busy spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of other threads.
Non Critical, Non Spinning Thread
Non critical threads can be affinitized to the set of default cpu's configured by nv.defaultcpuaffinitymask.
Critical or busy threads should be pinned to their own core:
Best Practices
Before launching your process, validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.
Check what other processes on the host will use busy spinning and find out the cores they will use.
In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible.
Launching with NUMA Affinitization
To affinitize the XVM process to a particular socket, you can launch the XVM using numactl. For example, if you affinitize threads to cores on socket 0, it is best to allocate memory only from numa node 0:
Validating Affinitization
The easiest way to check your work is to enable server thread stats trace which outputs thread usage such as:
You can look for any spinning thread (CPU% at 100) that doesn't have an affinity aside. This will help you avoid the following pitfalls:
Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId.
Having some other non X process spinning on one of the coreIds that you’ve affinitized to.
Affinitizing across multiple socketIds (which are on different numa nodes can make performance worse).
You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.
Programmatically, you can also call
...which will dump the affinitization state of all threads affinitized through UtlThread:
The above trace will also be printed by an AepEngine after it is started when the the nv.aep trace level is set to "config" or higher. The above affinity dump can also be invoked as an xvm command via Robin or Lumino.
Limitations
The following limitations apply to thread affinitization support:
Thread affinitization is currently only support on linux.
Affinitization is limited to being able to affinitize threads to logical cores 0 through 63.
Affinitization of a thread does not reserve the cpu core, just limits the cores on which a thread will execute. This is important, because if not all threads are affinitized the os thread scheduler may schedule another thread on top of a critical thread if cpu resources are scarce.