In This Section

Overview

To achieve the lowest possible latency and best throughput with minimal jitter, the platform supports the ability to pin critical threads to individual cpu cores. Generally speaking, the best performance for an application will be achieved when:

All application threads can be affinitized to the same processor socket (sharing the same L3 cache),
Memory for the process is affinitized to the above socket (to avoid remote numa node access),
Critical threads are pinned to their own cpu core and are set to busy spin rather than sleep or block (to avoid context switches),
Hyperthreading is disabled (to avoid threads being scheduled onto the same physical core as a busy thread).

Benefits of thread affinitization

Pinning a thread to a particular cpu ensures that the os won’t reschedule the thread to another core and incur a context switch that would force the thread to reload its working state from memory and result in jitter. When all critical threads in the processing pipeline are pinned to their own cpu and busy spinning, the os scheduler is less likely to schedule another thread onto that core, keeping that threads processor caches hot.

NUMA Considerations

As an in memory computing platform, X application's primary storage mechanism is memory. Consequently, from a Von Neumman Bottleneck perspective the platform will operate best when the processor is as close to its in memory state as possible. On NUMA (Non Uniform Memory Architectures) machines, this means pinning the application memory used to same processor as

The goal of NUMA pinning is to ensure that processor threads are operating on memory that is physically closest to them:

Making sure threads execute on a core closest to the memory allocated to the process decreases memory access time for a thread to access state from main memory.
When 2 threads are collaborating within the process, say passing a message from one thread to another, pinning ensures the minimum memory distance for the receiving thread.

Hyper Threading Considerations

When a machine has enough cpu capacity, it is generally best to disable hyper threading. When hyper threading is enabled each logical cpu will share the physical core's cpu caches which means that there is less cache space available per logical process.

Configuring Affinitization

To effectively optimize an application via affinitization

Determining CPU Layout

The cpu layout is machine dependent. Before configuring cpu affinity masks it is thus necessary to determine the cpu layout on the target machine. Talon includes a utility class, UtlThread, that can be run to assist with this:

which will produce output similar to the following

0: CpuInfo{socketId=0, coreId=0, threadId=0}
1: CpuInfo{socketId=1, coreId=0, threadId=0}
2: CpuInfo{socketId=0, coreId=8, threadId=0}
3: CpuInfo{socketId=1, coreId=8, threadId=0}
4: CpuInfo{socketId=0, coreId=2, threadId=0}
5: CpuInfo{socketId=1, coreId=2, threadId=0}
6: CpuInfo{socketId=0, coreId=10, threadId=0}
7: CpuInfo{socketId=1, coreId=10, threadId=0}
8: CpuInfo{socketId=0, coreId=1, threadId=0}
9: CpuInfo{socketId=1, coreId=1, threadId=0}
10: CpuInfo{socketId=0, coreId=9, threadId=0}
11: CpuInfo{socketId=1, coreId=9, threadId=0}
12: CpuInfo{socketId=0, coreId=0, threadId=1}
13: CpuInfo{socketId=1, coreId=0, threadId=1}
14: CpuInfo{socketId=0, coreId=8, threadId=1}
15: CpuInfo{socketId=1, coreId=8, threadId=1}
16: CpuInfo{socketId=0, coreId=2, threadId=1}
17: CpuInfo{socketId=1, coreId=2, threadId=1}
18: CpuInfo{socketId=0, coreId=10, threadId=1}
19: CpuInfo{socketId=1, coreId=10, threadId=1}
20: CpuInfo{socketId=0, coreId=1, threadId=1}
21: CpuInfo{socketId=1, coreId=1, threadId=1}
22: CpuInfo{socketId=0, coreId=9, threadId=1}
23: CpuInfo{socketId=1, coreId=9, threadId=1}

In the above we can see:

The machine has 23 logical cpus (0 through 23)
There are two processor sockets (socketId=1, socketId=2)
There are 12 physical cores total ... 6 physical cores per socket (coreIds 0, 1, 2, 8, 9 and 10)
Hyper threading is enabled and there are two threads per socke (threadId=1, threadId=2)

Configuring CPU Affinities

CPU Affinity Mask Format

Thread affinities are configured by supplying a mask that that indicates the cores on which a thread can run. The mask can either be a long bit mask of logical cpu or a square bracket enclosed comma separated list enumerating the logical cpus to which to a thread should be affinitized. The latter format is recommended as it is easier to read.

Examples:

"0" no affinity specified (0x0000)
"[]" no affinity specified
"1" specifies logical cpu 0 (0x0001)
"[0]" specifies logical cpu 0
"4" specifies logical cpu 2 (0x0100)
"[2]" list specifying logical cpu 2
"6" mask specifying logical cpu 1 and 2 (0x0110)
"4294967296" specifies logical cpu 32 (0x1000 0000 0000 0000 0000 0000 0000 0000)
"[32]" specifies logical cpu 32
"[1,2]" list specifying logical cpu 1 and 2

Enabling Affinitization

By default cpu affinization is disabled. To enable it you can set the following env flags in the DDL configuration:

Affinitizing Critical Threads

Affinitizing Non Platform Threads

If your application uses its own threads, they can be affinized as well using the platform's UtlThread utility class. Non critical threads that are not busy threads should be affinitized to the default cores and critical or busy threads should be pinned to their own core.

Non Critical, Non Spinning Thread

Critical or busy thread

Best Practices

Before launching your process validate that there aren’t other processes running that are spinning on a core to which you are affinitizing.
Check what other processes on the host will use busy spinning and find out the cores they will use.
In Linux, the OS often uses Core 0 for some of its tasks, so it better to avoid this core if possible.

Configuring NUMA Affinitization

To affinitize the XVM process to a particular socket, you can launch the XVM using numactl. For example, if you affinitize threads to cores on socket 0, it is best to allocate memory only from numa node 0:

Validating Affinitization

On linux, invoking the top `top` command and pressing 1 to ensure that they don’t see spinning cores outside of the masks that they have configured. To avoid the following pitfalls:

Having two threads spinning on the same coreId will make performance worse (either same coreId but different threadId or worse on the same codeId/threadId
Having some other non X process spinning on one of the coreIds that you’ve affinitized to.
Affinitizing across multiple socketIds (which are on different numa nodes can make performance worse).
You will be limited in your max heap to the amount of physical memory in that processor bank of the NUMA node to which you are pinning.

Limitations

The following limitations apply to thread affinitization support:

Thread affinitization is currently only support on linux.
Affinitization is limited to being able to affinitize threads to logical cores 0 through 63
Affinitization of a thread does not reserve the cpu core, just limits the cores on which a thread will execute. This is important, because if not all threads are affinitized the os thread scheduler may schedule another thread on top of a critical thread if cpu resources are scarce.

The Talon Manual

Additional Links

Overview

Configuring Affinitization

Determining CPU Layout

Configuring CPU Affinities

CPU Affinity Mask Format

Enabling Affinitization

Affinitizing Critical Threads

Affinitizing Non Platform Threads

Non Critical, Non Spinning Thread

Critical or busy thread

Best Practices

Configuring NUMA Affinitization

Validating Affinitization

Limitations

The Talon Manual

Additional Links

Tuning Thread Affinitization and NUMA

Overview

Configuring Affinitization

Determining CPU Layout

Configuring CPU Affinities

CPU Affinity Mask Format

Enabling Affinitization

Affinitizing Critical Threads

Affinitizing Non Platform Threads

Non Critical, Non Spinning Thread

Critical or busy thread

Best Practices

Configuring NUMA Affinitization

Validating Affinitization

Limitations