The Talon Manual

Skip to end of metadata
Go to start of metadata

In This Section

Overview

Talon's threading model is key to its ability to achieve extreme performance levels. This section discusses the key concepts related to threading and high-performance computing and discusses configuration options to get the most performance out of Talon. Before jumping into this section, be sure to read through The Talon Application Flow. The key architectural pieces to grasp when thinking about threading in Talon micro apps are: 

  • Single thread writes to application state. 
  • Pipeline execution of application transactions. 

This section discusses these concepts and contains a listing of the platform threads in play in Talon applications. 

The Single Writer Principle

Key to performance in Talon is the mandate that write access to application state is single threaded. The single writer principle posits that "when trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource". It is the main motivator behind architectural patterns such as the actor model and micro-services. One of the major performance advantages of a micro-app architecture is that by making all state private to the application it reduces write contention. By bringing all application state into memory Talon further reduces the cost of updating data by keeping it as close the business logic that is operating on it as possible. But even with all state in main memory, there are significant costs to multiple threads operating on the same piece of data at the processor cache level. 

Every Talon application is backed by an AepEngine with a single input multiplexer thread that consumes events and messages coming in from message buses and dispatches them to the application on a single thread that serves as the single writer for an application's state. Application authors thus do not need to concern themselves with synchronization or locking. 

As with most architectures, horizontal scalability in Talon can be achieved by partitioning state across instances (whether that by in the same JVM, same machine or multiple machines), but it is usually desirable to reduce the number of shards or eliminate the need for sharding altogether in order to reduce cost and complexity. A single writer architecture is a key component of reducing hardware inefficiency as it avoids wasting processor and memory resources associated with managing inter-thread contention. 

Understanding Detached Threads

As discussed above, the Talon application programming model is single threaded. Consequently, it is desirable to keep the application's single business logic thread busy performing application logic as much as possible (as opposed to spending cycles on infrastructural concerns such as replication or persistence). Consequently, the platform provides the ability to do much of this non-functional heavy lifting in background threads that are detached from the business logic. 

Work that can be configured to be done in detached threads include:

  • Replication (Detached Store Sender, Detached Store Dispatcher)
  • Persistence (Detached Persister)
  • Intercluster Replication (Detached ICR Sender)
  • Message Logging (Detached Inbound / Outbound Message Loggers) 

The Listing of Thread section below describes these threads in more detail. For optimal latency and throughput, these threads can also be affinitized to particular CPU cores to reduce jitter hits on throughput from thread context switching (see Tuning Thread Affinitization and NUMA).

Understanding Disruptors

Effective pipelining between threads is key to Talon's performance meaning that inter-thread communication must be optimal. Talon uses lmax disruptors to pass data between critical threads in the processing pipeline. Throughout X configuration you will see configuration used to configure disruptors that looks like:

KnobDescription
queueDepth

The size of the ring buffer.

This knob controls the size of the ring buffer. It is best to choose a power of 2 for ring buffer. The buffer should be sized large enough to absorb spikes in application traffic without blocking the offering thread, but otherwise should generally be kept small enough to keep the amount of active data in the pipeline small enough to avoid taxing CPU caches.

The default size for most disruptors is 1024.

queueWaitStrategy

Controls how the thread draining events from the ring buffer waits for more events. One of BusySpin|Yielding|Sleeping|Blocking.

For applications that want the lowest latency possible using BusySpin causes the draining thread to spin without signally to the OS that it should be context switched which avoids jitter. This policy is most appropriate when the number of cores available in the machine is adequate for each reader to occupy its own core. Otherwise, a Yielding wait strategy can be used. Both BusySpin and Yielding are CPU intensive and are most appropriate for applications where performance is critical and run on hardware dedicated to the application.

queueDrainerCpuAffinityMask
Controls the CPU to which to affinitize the draining thread. For BusySpin or Yielding policies, affinitizing threads can further reduce jitter. (see Tuning Thread Affinitization and NUMA).
queueOfferStrategy
This can be used to override the offer strategy used to manage concurrency that is used when offering elements to the ring buffer.

(warning) In general, applications should not change this property as the platform will choose a sensible default.

Auto Tuning of Disruptor Wait Strategies. 

When the nv.optimizefor environment property is set to latency or throughput disruptors in the critical path are automatically set to BusySpin and Yielding respectively unless explicitly configured as otherwise via configuration. 

You can set the environment property


Listing of Threads

Talon Application Threads

ThreadNameCritical PathDescription
AEP Engine Input MultiplexerX-STEMux-<appName>-<instanceid>Yes

The engine thread that dequeues and dispatch application messages and events. This is the main application thread on which application events are dispatched. They are suffixed with a global counter to allow differentiating between stats emitted by multiple instances of the same application running in the same JVMAep.

The detached threads described outlined below can offload work from this thread which can improve throughput and latencies in your application.

(lightbulb) Note that when running in an XVM you will see AEP engine threads created for the XVM's admin application which is used to handle XVM admin requests.

X-EventMultiplexer-Wakeup-<appName>NoA timer thread use to wake up and dispatch events scheduled via the engine's input queue.
Detached Inbound Message LoggerX-ODS-StoreLog-<appName>.inNoWhen the application is configured with a detached inbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

As Inbound message loggers aren't used for HA purposes they are not on the critical path,

(lightbulb) If your application's inbound message load is not high, a detached inbound message logger may not be needed. The tleg3 transaction latency statistic covers inbound message logging. High values or spikes is an indicator that a detached inbound message logger can help
Detached Outbound Message LoggerX-ODS-StoreLog-<appName>.outNo 

When the application is configured with a detached outbound message logger, this thread offloads the work of writing to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

 

(lightbulb) If your application's outbound message load is not high, a detached outbound message logger may not be needed. The tleg3 transaction latency statistic covers outbound message logging. High values or spikes is an indicator that a detached outbound message logger can help
Per Transaction Stats LoggerX-ODS-StoreLog-<appName>.txnstatsNo

When the application is configured with a detached per transaction stats logger, this thread offloads the work of writing to disk from the engine multiplexer which can serve as a buffer against disk I/O spikes.

(lightbulb) If your application's transaction load is not high, a detached per transaction stats logger message logger may not be needed. The cepilo (commit epilogue) transaction latency statistic covers per transaction stats logging costs. High values or spikes in cepilo is an indicator that a detached per transactions stats logger can help.

Bus Threads

Each bus configured for your application is managed by a Bus Manager internal to the AEP engine.

In addition to application configured buses, an engine creates several internal buses which are not on the critical processing path:

  • control-<uuid>: used for attaching internal AEP control events to transactions (for exampling fault tolerant timer related events).
  • client: used as a bus to which admin clients can connect to the application via an XVM rather than over a messaging bus.

Bus binding instances will generally create additional threads specific to the binding type.

Detached Bus Send ThreadX-AEP-BusManager-IO-<appName>.<busName>Yes

When the bus is configured for detached send, this thread offloads the work of serialization and writing of outbound messages from the engine's input multiplexer which serves as a buffer against spikes caused by message bus flow control.

(lightbulb) High values in the 'o2p, s, s2w, ws' message bus bindings stats are indicators that a detached bus sender can improve performance  

Bus Binding OpenerX-AEP-BusManager-BindingOpener-<appName>. <busName>No

Each bus configured for your application is managed by a Bus Manager internal to the AEP engine.

Handles establishment of the bus connects and reconnects.

Store Threads

Each store instance will minimally create a replication reader thread which handles reading from peers.

Store Reader ThreadX-ODS-StoreReplicatorLinkReader-<storeName>-<memberName>Yes

The IO thread for the store which is used to read replication traffic from cluster peers.

Detached Store PersisterX-ODS-StoreLog-<storeName>-<instanceid>No

When the store is configured for detached persistence, this thread offloads the work of writing recovery data to disk from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

(lightbulb) High values in the 'pers' (persistence) store binding stats is an indicator that a detached store sender can improve performance.

Detached ICR SenderX-ODS-Store-ICR-Sender-<storeName>-<instanceid>NoWhen the store is configured for detached inter-cluster replication, this thread offloads the work of writing recovery data to the receiver from the engine's input multiplexer which can serve as a buffer against disk I/O spikes.

(lightbulb) High values in the 'icr' (inter-cluster replication) store binding stats is an indicator that a detached store sender can improve performance.
Detached Store Send ThreadX-ODS-StoreReplicatorSender-<storeName>-<memberName>Yes

When the store is configured for detached send, this thread offloads the work of writing recovery data to the network for backup instances from the engine's input multiplexer which can serve as a buffer against network I/O spikes.

(lightbulb) High values in the 's2w' (serialize to wire) store binding stats is an indicator that a detached store sender can improve performance.

Detached Store Dispatch ThreadX-ODS-StoreReplicatorDispatcher-<storeName>-<memberName>Yes

When the store is configured for detached dispatch, this thread allows the store reader thread to offload work of dispatching deserialized replication traffic to the engine for processing this is useful in cases where the cost of deserializing replication traffic is high.

(lightbulb) A high value for the store binding deserialize stat ('d'), can indicate that setting this property could improve throughput or latency.

Store Acceptor ThreadX-ODS-StoreLinkAcceptor-<instanceid>NoEach store configured for clustering will create a thread that will listen for connection requests from other store members. Once the connection is established it is handed off to the store reader thread for processing.
Miscellaneous 
Stats Printer ThreadsX-Stats-Printer [<statName>-<instanceid>.stats]NoSeveral components can be configured to trace stats (independently) from XVM collected stats. When such stats threads are enabled a thread is created to periodically print stats. This is typically useful if an app is run outside of an XVM.
SchedulerX-Scheduler-<instance-count> NoA timer thread used for scheduling events. An AepEngine uses this to perform periodic engine health checks, for example.

Discovery Threads

ThreadThread NameCritical PathDescription
Discovery TimerX-EDP-TimerNo

Each discovery provider that is opened will create a timer thread that will periodically wake up to perform discovery broadcasts.

Each discovery provider typically will create additional threads specific to the discovery provider type. For example, when using an SMA-based discovery provider, message bus binding threads will be created.

XVM Threads

ThreadNameCritical Path 
XVM main threadX-Server-<xvmName>-MainNoThis thread creates and starts applications at startup and upon completion will drive server acceptors for accepting admin and direct connections to the XVM.
XVM stats collectorX-Server-<xvmName>-StatsRunnerNoCollects stats for applications, and populates them into heartbeats that can be traced, logged, dispatched, and emitted.
XVM dedicated IO threadX-Server-<xvmName>-IOThread-<threadNumber>Yes

When the server is configured for multithreading, additional IO threads beyond the XVM main thread are created that service connections that are affinitized to it.

(lightbulb) When using the direct binding the IO thread is on the critical path for received messages.

  • No labels