The Talon Manual

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

In This Section

Overview

To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual cpu cores. Generally speaking, the best performance for an application will be achieved when:

  • All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
  • Memory for the process is affinitized to the above socket (to avoid remote numa node access),
  • Critical threads are pinned to their own cpu core and are set to busy spin rather than sleep or block (to avoid context switches),
  • Hyperthreading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread). 

Benefits of thread affinitization

Pinning a thread to a particular cpu ensures that the os won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from memory and result in jitter. When all critical threads in the processing pipeline are pinned to their own cpu and busy spinning, the os scheduler is less likely to schedule another thread onto that core, keeping the threads processor caches hot.

NUMA Considerations

As an in memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumman Bottleneck perspective, the platform will operate best when the processor is as close to its in memory state as possible. On NUMA (Non Uniform Memory Architectures) machines, this means pinning the application memory used to the closest processor.

The goal of NUMA pinning is to ensure that processor threads are operating on memory that is physically closest to them:

  • Making sure threads execute on a core closest to the memory allocated to the process decreases memory access time for a thread to access state from main memory. 
  • When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance for the receiving thread.

Hyper Threading Considerations

When a machine has enough cpu capacity, it is generally best to disable hyper threading. When hyper threading is enabled each logical cpu will share the physical core's cpu caches, which means that there is less cache space available per logical process. 

Configuring Affinitization

To effectively optimize an application via affinitization 

Determining CPU Layout

The cpu layout is machine dependent. Before configuring cpu affinity masks, it is necessary to determine the cpu layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:

which will produce output similar to the following

0: CpuInfo{socketId=0, coreId=0, threadId=0}
1: CpuInfo{socketId=1, coreId=0, threadId=0}
2: CpuInfo{socketId=0, coreId=8, threadId=0}
3: CpuInfo{socketId=1, coreId=8, threadId=0}
4: CpuInfo{socketId=0, coreId=2, threadId=0}
5: CpuInfo{socketId=1, coreId=2, threadId=0}
6: CpuInfo{socketId=0, coreId=10, threadId=0}
7: CpuInfo{socketId=1, coreId=10, threadId=0}
8: CpuInfo{socketId=0, coreId=1, threadId=0}
9: CpuInfo{socketId=1, coreId=1, threadId=0}
10: CpuInfo{socketId=0, coreId=9, threadId=0}
11: CpuInfo{socketId=1, coreId=9, threadId=0}
12: CpuInfo{socketId=0, coreId=0, threadId=1}
13: CpuInfo{socketId=1, coreId=0, threadId=1}
14: CpuInfo{socketId=0, coreId=8, threadId=1}
15: CpuInfo{socketId=1, coreId=8, threadId=1}
16: CpuInfo{socketId=0, coreId=2, threadId=1}
17: CpuInfo{socketId=1, coreId=2, threadId=1}
18: CpuInfo{socketId=0, coreId=10, threadId=1}
19: CpuInfo{socketId=1, coreId=10, threadId=1}
20: CpuInfo{socketId=0, coreId=1, threadId=1}
21: CpuInfo{socketId=1, coreId=1, threadId=1}
22: CpuInfo{socketId=0, coreId=9, threadId=1}
23: CpuInfo{socketId=1, coreId=9, threadId=1}

In the above, we can see:

  • The machine has 23 logical cpus (0 through 23).
  • There are two processor sockets (socketId=1, socketId=2).
  • There are 12 physical cores total ... 6 physical cores per socket (coreIds  0, 1, 2, 8, 9 and 10).
  • Hyper threading is enabled and there are two threads per socke (threadId=1, threadId=2).

Configuring CPU Affinities

CPU Affinity Mask Format

Thread affinities are configured by supplying a mask that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical cpu, or a square bracket enclosing a comma separated list enumerating the logical cpus to which to a thread should be affinitized. The latter format is recommended as it is easier to read. 

Examples:

  • "0" no affinity specified (0x0000)
  • "[]" no affinity specified
  • "1" specifies logical cpu 0 (0x0001)
  • "[0]" specifies logical cpu 0
  • "4" specifies logical cpu 2 (0x0100)
  • "[2]" list specifying logical cpu 2
  • "6" mask specifying logical cpu 1 and 2 (0x0110)
  • "4294967296" specifies logical cpu 32  (0x1000 0000 0000 0000 0000 0000 0000 0000)
  • "[32]" specifies logical cpu 32
  • "[1,2]" list specifying logical cpu 1 and 2 

 

Enabling Affinitization

By default, cpu affinization is disabled. To enable it you can set the following env flags in the DDL configuration:

Affinitizing Critical Threads

Threads that can/should be affinitized include the following:

ThreadDescription
Engine Input Multiplexer

The engine thread that dequeues and dispatch application messages and events. This is the main application thread.

The detached threads described outlined below can offload work from this thread which can improve throughput and latencies in your application.

Bus Detached Sender Thread

Each bus configured for your application can optionally be configured to send committed outbound messages on a detached thread. When the bus is configured for detached send, this thread offloads the work of serialization and writing of outbound messages from the engine's input multiplexer which serves as a buffer against spikes caused by message bus flow control.

(lightbulb) High values in the 'o2p, s, s2w, ws' message bus bindings stats are indicators that a detached bus sender can improve performance  

Store Reader Thread

The IO thread for the store. On a primary instance this is the thread that dispatches store acknowledgements back into the engine. On a backup this is the thread that dispatches received replication traffic.

Store Detached Persister

When the store is configured for detached persistence, this thread offloads the work of writing recovery data to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

(lightbulb) High values in the 'pers' (Persistence) store binding stats is an indicator that a detached store sender can improve performance 

Store Detached Send Thread

When the store is configured for detached send, this thread offloads the work of writing recovery data to the network for backup instances from the engine's input multiplexer which can serve as a buffer against network I/O spikes.

(lightbulb) High values in the 's2w' (Serialize To Wire) store binding stats is an indicator that a detached store sender can improve performance

Store Detached Dispatch Thread

When the store is configured for detached dispatch, this thread allows the store reader thread to offload work of dispatching deserialized replication traffic to the engine for processing this is useful in cases where the cost of deserializing replication traffic is high.

(lightbulb) A high value for the the store binding deserialize stat ('d'), can indicate that setting this property could improve throughput or latency.

Detached Inbound Message LoggerWhen the application is configured with a detached inbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

(lightbulb) If your application's inbound message load is not high, a detached inbound message logger may not be needed. The tleg3 transaction latency statistic covers inbound message logging. High values or spikes is an indicator that a detached inbound message logger can help
Detached Outbound Message Logger

When the application is configured with a detached outbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.


(lightbulb) If your application's outbound message load is not high, a detached outbound message logger may not be needed. The tleg3 transaction latency statistic covers outbound message logging. High values or spikes is an indicator that a detached inbound message logger can help

Bus Specific Threads

In addition to the core threads above, some bus binding also support additional thread which may be affinitized. 

Solace
PropertyProperty nameDescription

Detached Dispatch Thread

dispatcher_cpu_affinity_mask

When set to true a detached dispatch thread is created that offloads the work of deserializing messages and dispatching from the solace thread reading messages from the wire.

(lightbulb) For applications in which deserialization and dispatch cost is high, enabling detached dispatch can improve throughput and decrease latency.

Solace Consumer Session Dispatch Thread

consumer_cpu_affinity_mask

This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.

(lightbulb) For highly active sessions the solace thread can take up a fair amount of cpu and it is on the critical latency path.

(info) Because this thread is not under the platform's control the thread is affinitized the first time it calls into the binding.

Solace Producer Session Dispatch Thread

producer_cpu_affinity_mask

This property attempts to affinitize the solace client library's receiver thread for this connections consumer session.

(lightbulb) Acknowledgements come in on the producer session. This thread is not on the critical latency path, so it doesn't always warrant a core of its own.

(info)  Because this thread is not under the platform's control the thread is affinitized the first time it calls into the binding 

Detached Send Thread

detached_sends_cpu_affinity_mask

Configures the cpu affinity mask for detached sender thread.

(lightbulb) Enabling the detached send thread is usually needed for talon applications because they can use the engine's bus detached send thread described in the previous section. This property may be useful for applications using the solace binding outside of a Talon application.

Examples of configuring affinities for solace bindings. 

Direct Binding

The direct binding allows applications to connect directly to application via the xvm. In this case it is recommended that 

Affinitizing Non Platform Threads

If your application uses its own threads, they can be affinized as well using the platform's UtlThread utility class. Non critical threads that are not busy spinning threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core to prevent them from being scheduled on top of other threads. 

Non Critical, Non Spinning Thread

Non critical threads can be affinitized to the set of default cpu's configured by nv.defaultcpuaffinitymask. 

Critical or busy threads should be pinned to their own core:

Best Practices

  • Before launching your process, validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.

  • Check what other processes on the host will use busy spinning and find out the cores they will use.

  • In Linux, the OS often uses Core 0 for some of its tasks, so it is better to avoid this core if possible. 

Launching with NUMA Affinitization

To affinitize the XVM process to a particular socket, you can launch the XVM using numactl. For example, if you affinitize threads to cores on socket 0, it is best to allocate memory only from numa node 0:

Validating Affinitization

The easiest way to check your work is to enable server thread stats trace which outputs thread usage such as:

You can look for any spinning thread (CPU% at 100) that doesn't have an affinity aside. This will help you avoid the following pitfalls: 

  • Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId.

  • Having some other non X process spinning on one of the coreIds that you’ve affinitized to.

  • Affinitizing across multiple socketIds (which are on different numa nodes can make performance worse).

  • You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.


Programmatically, you can also call

...which will dump the affinitization state of all threads affinitized through UtlThread:

The above trace will also be printed by an AepEngine after it is started when the the nv.aep trace level is set to "config" or higher. The above affinity dump can also be invoked as an xvm command via Robin or Lumino. 

Limitations

The following limitations apply to thread affinitization support:

  • Thread affinitization is currently only support on linux. 

  • Affinitization is limited to being able to affinitize threads to logical cores 0 through 63.

  • Affinitization of a thread does not reserve the cpu core, just limits the cores on which a thread will execute. This is important, because if not all threads are affinitized the os thread scheduler may schedule another thread on top of a critical thread if cpu resources are scarce. 

 

  • No labels