In This Section
Overview
To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual cpu cores. Generally speaking, the best performance for an application will be achieved when:
- All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
- Memory for the process is affinitized to the above socket (to avoid remote numa node access),
- Critical threads are pinned to their own cpu core and are set to busy spin rather than sleep or block (to avoid context switches),
- Hyperthreading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread).
Benefits of thread affinitization
Pinning a thread to a particular cpu ensures that the os won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from memory and result in jitter. When all critical threads in the processing pipeline are pinned to their own cpu and busy spinning, the os scheduler is less likely to schedule another thread onto that core, keeping that threads processor caches hot.
NUMA Considerations
As an in memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumman Bottleneck perspective the platform will operate best when the processor is as close to its in memory state as possible. On NUMA (Non Uniform Memory Architectures) machines, this means pinning the application memory used to same processor as
The goal of NUMA pinning is to ensure that processor threads are operating on memory that is physically closest to them:
- Making sure threads execute on a core closest to the memory allocated to the process decreases memory access time for a thread to access state from main memory.
- When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance for the receiving thread.
Hyper Threading Considerations
When a machine has enough cpu capacity, it is generally best to disable hyper threading. When hyper threading is enabled each logical cpu will share the physical core's cpu caches which means that there is less cache space available per logical process.
Configuring Affinitization
To effectively optimize an application via affinitization
Determining CPU Layout
The cpu layout is machine dependent. Before configuring cpu affinity masks it is thus necessary to determine the cpu layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:
which will produce output similar to the following
0: CpuInfo{socketId=0, coreId=0, threadId=0} 1: CpuInfo{socketId=1, coreId=0, threadId=0} 2: CpuInfo{socketId=0, coreId=8, threadId=0} 3: CpuInfo{socketId=1, coreId=8, threadId=0} 4: CpuInfo{socketId=0, coreId=2, threadId=0} 5: CpuInfo{socketId=1, coreId=2, threadId=0} 6: CpuInfo{socketId=0, coreId=10, threadId=0} 7: CpuInfo{socketId=1, coreId=10, threadId=0} 8: CpuInfo{socketId=0, coreId=1, threadId=0} 9: CpuInfo{socketId=1, coreId=1, threadId=0} 10: CpuInfo{socketId=0, coreId=9, threadId=0} 11: CpuInfo{socketId=1, coreId=9, threadId=0} 12: CpuInfo{socketId=0, coreId=0, threadId=1} 13: CpuInfo{socketId=1, coreId=0, threadId=1} 14: CpuInfo{socketId=0, coreId=8, threadId=1} 15: CpuInfo{socketId=1, coreId=8, threadId=1} 16: CpuInfo{socketId=0, coreId=2, threadId=1} 17: CpuInfo{socketId=1, coreId=2, threadId=1} 18: CpuInfo{socketId=0, coreId=10, threadId=1} 19: CpuInfo{socketId=1, coreId=10, threadId=1} 20: CpuInfo{socketId=0, coreId=1, threadId=1} 21: CpuInfo{socketId=1, coreId=1, threadId=1} 22: CpuInfo{socketId=0, coreId=9, threadId=1} 23: CpuInfo{socketId=1, coreId=9, threadId=1}
In the above we can see:
- The machine has 23 logical cpus (0 through 23)
- There are two processor sockets (socketId=1, socketId=2)
- There are 12 physical cores total ... 6 physical cores per socket (coreIds 0, 1, 2, 8, 9 and 10)
- Hyper threading is enabled and there are two threads per socke (threadId=1, threadId=2)
Configuring CPU Affinities
CPU Affinity Mask Format
Thread affinities are configured by supplying a mask that that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical cpu or a square bracket enclosed comma separated list enumerating the logical cpus to which to a thread should be affinitized. The latter format is recommended as it is easier to read.
Examples:
- "0" no affinity specified (0x0000)
- "[]" no affinity specified
- "1" specifies logical cpu 0 (0x0001)
- "[0]" specifies logical cpu 0
- "4" specifies logical cpu 2 (0x0100)
- "[2]" list specifying logical cpu 2
- "6" mask specifying logical cpu 1 and 2 (0x0110)
- "4294967296" specifies logical cpu 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
- "[32]" specifies logical cpu 32
- "[1,2]" list specifying logical cpu 1 and 2
Enabling Affinitization
By default cpu affinization is disabled. To enable it you can set the following env flags in the DDL configuration:
Affinitizing Critical Threads
Affinitizing Non Platform Threads
If your application uses its own threads, they can be affinized as well using the platform's UtlThread utility class. Non critical threads that are not busy threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core.
Non Critical, Non Spinning Thread
Critical or busy thread
Best Practices
- Before launching your process validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.
- Check what other processes on the host will use busy spinning and find out the cores they will use.
- In Linux, the OS often uses Core 0 for some of its tasks, so it better to avoid this core if possible.
Configuring NUMA Affinitization
To affinitize the XVM process to a particular socket, you can launch the XVM using numactl. For example, if you affinitize threads to cores on socket 0, it is best to allocate memory only from numa node 0:
Validating Affinitization
On linux, invoking the top `top` command and pressing 1 to ensure that they don’t see spinning cores outside of the masks that they have configured. To avoid the following pitfalls:
- Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId
- Having some other non X process spinning on one of the coreIds that you’ve affinitized to.
- Affinitizing across multiple socketIds (which are on different numa nodes can make performance worse).
- You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.
Limitations
The following limitations apply to thread affinitization support:
- Thread affinitization is currently only support on linux.
- Affinitization is limited to being able to affinitize threads to logical cores 0 through 63
- Affinitization of a thread does not reserve the cpu core, just limits the cores on which a thread will execute. This is important, because if not all threads are affinitized the os thread scheduler may schedule another thread on top of a critical thread if cpu resources are scarce.