The Talon Manual

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Perform higher level statistical computations such as calculating message rates and average latencies and output its results to a named trace logger using the platform's tracing and logging system.
  2. Detect violation/abatement of alert thresholds and dispatch appropriate alert events to the application.

The raw metrics are collected by the engine are used by the background statistical thread for its computations and can also be retreived programatically retrieved programmatically by an application for its own use.
In this document, we describe:

  • Describe the various raw metrics captured by an engine,
  • Describe how to switch on and configure the statistics thread operation,
  • Describe the higher level statistics calculations performed by the statistics thread,
  • Describe and the format of the output of the statistics thread.

...

As collection of AEPEngine involves overhead, the engine provides somewhat granular controls on the types of stats that are collected to allow applications to balance the cost of collecting statistics. The table below summarizes the various stats collection that can be enabled.

Configuration SettingDefaultDescription
nv.aep.<engine>.latency.statsfalseIndicates whether or not latency stats should for the engine should be collected. The latency stats include transaction level latencies that can be used in monitoring the overal overall latencies of the various legs of a transaction.
nv.stats.series.samplesize10240

Property that can be used to control the default sampling size for series stats.

If the number of datapoints collected in a stats interval exceeds this size, the computation for histographical data will be lossy, increasing . Increasing the value will reduce reduces loss of datapoints, but results in greater overhead in stats collection in terms of both memory usage and pressure on the process caches.

nv.aep.<engine>.msgtype.statsfalseIndicates whether or not per-message-type statistics should be collected. When enabled, processing latencies are recorded on a per message basis. Per message type stats introduce a fair amount of overhead, so typically enabling their collection is enabled only during application profiling for application with stringent performance requirements.
nv.aep.<engine>.event.latency.statsfalseIndicates whether or not per-event-type latency statistics should be included in a given engine's stats trace output. Event latency statistics can be used to capture the processing latencies for each type of event being processed in the engine. This is useful in determining if particular event types are consuming considerable engine thread processing time. Per event type stats introduce a fair amount of overhead, so typically enabling their collection is enabled only during application profiling for application with stringent performance requirements.
nv.msg.latency.statsfalse

Property that enabled message latency stats tracking.

When set to true, timings for messages are captured as they flow through the system. Enablement of these stats is required to collect message bus latency stats. Enabling this property can increase latency due to the overhead of tracking timestamps.

nv.msgtype.latency.statsfalse

Property that enabled per message type latency stats tracking.

When set to true, timings for each message type are individually tracked as separate stats. This can be useful in tracking down issues in which a particular message type is problematic (for example, tracking down a high application handler message processing time). However, it results in a higher overhead.

nv.ods.latency.statsfalseIndicates whether or not store latency statistics are captured. Store latency stats expose statistics related to serializing replicating and persisting transactions.
nv.event.latency.statsfalseIndicates whether or not event latency statistics are captured. Enabling Event latency stats record timestamps for enqueue and dequeue of events across event multiplexer, such as the AepEngine's input multiplexer queue. Enabling event latency stats is useful for determining if an engines engine's event multiplexer queue is backing up by recording the time that events remain on the input queue.
nv.link.network.stampiotsfalse

Indicates whether or not timestamps should be stamped on inbound and outbound messages. Disabled by default.

Enabling this setting will allow engines to provide more detail in the form of transaction legs in message latency statistics.

...

The following output threads can be enabled to trace statistics, which is useful for testing and performance tuning. Enabling these output threads is not required for collecting stats. Statistics trace output is not zero garbage, so in a production scenario it usually makes more sense to the collect stats via Server Statistics, which emits zero garbage hearbeats heartbeats with the above statistics. 

Configuration SettingDefaultDescription
nv.aep.<engine>.stats.interval0

The interval (in seconds) at which engine stats will be traced for a given engine.

Can be set to a postive positive integer indicate the period in seconds at which the engine's stats dump thread will dump recorded engine statistics. Setting a value of 0 disables creation of the stats thread.

When enabled, engine stats are traced to the logger 'nv.aep.engine.stats' at a level of Tracer.Level.INFO; therefore, so to see dumped stats, a trace level of 'nv.aep.engine.stats.trace=info' must be enabled.

NOTE:: disabling the engine stats thread only stops stats from being periodically traced. It does not stop the engine from collecting stats ... ; stats can still be collected by an external thread (such as the Talon Server which reports the stats in server heartbeats). In other words, enabling the stats thread is not a prerequisite for collecting stats, and disabling the stats reporting thread does not stop them from being collected.

NOTE: that while  while collection of engine stats is a zero garbage operation, tracing engine stats is not a zero garbage when performed by this stats thread. For latency sensitive apps, it is recommended to run in a Talon server which can collect engine stats and report them in heartbeats in a zero garbage fashion.

nv.aep.<engine>.sysstats.interval0

The interval (in seconds) at which engine sys stats will be reported. Set to 0 (the default) to completely disable sys stats tracing for a given engine.

In most cases, AEP sys stats will not be used and system level stats would be recorded in the Server Statistics] from which an AEPEngine is running.

nv.event.mux.<name>.stats.interval0

The interval (in seconds) at which multiplexer stats will be traced.

Multiplexer stats can also be reported as part of the overall engine stats from the engine stats thread, so there is no need to set this to a non-zero value if nv.aep.<engine>.stats.interval is greater than zero.

nv.msg.latency.stats.interval0

The interval (in seconds) at which message latency stats are traced.

 

This setting has no effect if nv.msg.latency.stats is false. This allows granular tracing of just message latency stats on a per bus basis. Message latency stats can also be reported as part of the overall engine stats from the engine stats thread, so there is no need to set this to a non-zero value if nv.aep.<engine>.stats.interval is greater than zero.

nv.aep.busmanager.<engine>.<bus>.stats.interval0The interval (in seconds) at which bus stats will be traced. Bus stats reported as part of the overall engine stats from the engine stats thread, so there is no need to set this to a non-zero value if nv.aep.<engine>.stats.interval is greater than zero. When engine stats output is disabled this can be used to trace only bus stats for a particular message bus.

...

Metric

Description

NumFlows

Total number of message flows functioning in the engine.

NumMsgsRcvdBestEffort

Total number of messages received by the engine on best-effort channels.

NumMsgsRcvdGuaranteed

Total number of messages received by the engine on guaranteed channels.

NumMsgsSourced

[EventSourcing Only]
Total number of messages sourced from from the recovery log or primary agent in a cluster during agent inialization.

NumMsgsFilteredThe number of messages that were filtered.

NumDupMsgsRcvd

Total number of duplicate messages received and discarded by an engine. This metric will always be 0 if duplicate detection has been disabled via the nv.aep.duplicate.checking configuration property.

NumMsgsSentBestEffort

Total number of messages sent by the engine on best-effort channels.

NumMsgsSentGuaranteed

Total number of messages sent by the engine on guaranteed channels.

NumMsgsResent

Total number of messages retransmitted by an engine.

When a backup agent assumes the role of a primary on failover, it restransmits all in-doubt messages as part of its first transaction. In-doubt messages are those messages for which positive acknowledgements have not been received from the downstream messaging components. This metric records the number of such restransmitted messages.

NumEventsRcvd

Total number of events received by an engine.

Events received by an engine include message events and the non-message events, e.g. 'channel up event', 'channel down event', etc.

NumFlowEventsRcvd

Total number of flow events received by the engine.

Flow events are synonymous with message events. These include message events received on message channels and message events injected for processing into the engine by the application.

NumFlowEventsProcSuccess

Total number of successfully processed flow events.

A flow event is considered successfully processed if the application did not throw an exception while processing the message associated with the event.

NumFlowEventsProcFail

Total number of failed flow events.

A flow event is considered to have failed processing if the application threw an unchecked exception while procesing the message associated with the event.

NumFlowEventsProcComplete

Total number of flow events whose transactions have completed.

Each successfully processed flow event participates in an AEP transaction. This metric counts the number of successful flow events whose transactions have completed.

NumTransactions

Total number of transactions that have been committed or rolled back.

Note: This metric is equal to NumCommitsStarted + NumRollbacks

NumCommitsStarted

Total number of transactions whose commits have been started.

Note: Transactions are committed in a pipelined manner. Therefore, there can be multiple transactions in the commit pipeline. This metric counts the number of transactions that have been entered into the commit pipeline.

See NumCommitsCompeted

NumCommitsCompleted

Total number of transactions whose commits have completed.

This metric counts the number of transactions that have completed and exited the transaction commit pipeline. The difference between this and the NumCommitsStarted metric is the number of transactions in flight at any point in time.

See NumCommitsStarted

NumSendCommitsStarted

Total number of transactions whose send commits have been started.

A transaction commit pipeline is comprised of a send commit pipeline (the commit of outbound messages in the transaction) and a store commit pipeline (commit to store changes). This metric counts the number of transactions in the send commit pipeline.

NumSendCommitsCompleted

Total number of transactions whose send commits have completed.

This metric counts the number of transactions whose send commits have completed and exited the send commit pipeline.

SendCommitCompletionQueueSize

Number of transaction in the send commit completion queue.

Transactions that complete the send portion of the commit are passed through a send commit completion queue for sequencing before entering into the next phase of a transaction commit. This metric counts the number of transactions held in the post send commit sequencing queue at any point in time.

NumStoreCommitsStarted

Total number of transactions whose store commits have been started.

A transaction commit pipeline is comprised of a send commit pipeline (the commit of outbound messages in the transaction) and a store commit pipeline (commit to store changes). This metric counts the number of transactions in the store commit pipeline.

NumStoreCommitsCompleted

Total number of transactions whose store commits have completed.

This metric counts the number of transactions whose store commits have completed and exited the store commit pipeline.

StoreCommitCompletionQueueSize

Number of transaction in the store commit completion queue.

Transactions that complete the store portion of the commit are passed through a store commit completion queue for sequencing before entering into the next phase of a transaction commit. This metric counts the number of transactions held in the post store commit sequencing queue at any point in time.

NumRollbacks

Number of transactions that have been rolled back.

BackupOutboundQueueSize

[EventSourcing Only]
Number of messages in a backup's outbound queue.

A backup outbound queue holds outbound messages sent during concurrent processing by a backup agent in an Event Sourced cluster. The messages are held in the queue as in-doubt messages until notifications are received from the primary 'acknowledging' the receipt of the messages from downstream messaging components/agents. The messages are flushed from the queue upon receipt of such notifications. This metric counts the number of messages in a backup's outbound queue.

Note: The notifications are piggy-backed on other replication traffic to avoid extra network traffic. Therefore, in the absence of replication traffic, messages can remain in the backup's outbound queue even though acknowledged by the downstream messaging system until some replication traffic occurs to replicate the downstream ack notification.

BackupOutboundLogQueueSize

[EventSourcing Only]
Number of messages in a backup's outbound log queue.

A backup's outbound log queue holds outbound messages for the course of a single transaction. The queue is flushed to the outbound message log upon completion of the transaction. This metric holds the number of messages in the outbound log queue.

OutboundSno

The current outbound sequence number in use by the engine.

Each engine instance maintains a sequence number space for messages outbound by the engine. This metric holds the current outbound sequence number.

OutboundStableSno

The current 'stable' outbound sequence number.

An outbound is considered to be stable when a positive acknowledgement is received for the message from the downstream messaging system. This metric holds the sequence number of the last stable outbound message.

Message Type Specific Statistics

Message type specific statistics.

The engine also (optionally driven by configuration) maintains statistics for each of the different message types flowing through the engine. The following are the different metrics collected for each message type:

  • NumMsgsRcvdBestEffort
  • NumMsgsRcvdGuaranteed
  • NumMsgsSourced
  • NumDupMsgsRcvd
  • NumMsgsSentBestEffort
  • NumMsgsSentGuaranteed
  • NumMsgsResent

    The semantic meaning of these metrics for each message type is identical to metrics with the same name described above, except that it is local to the message type.

...

The AEP engine's statistics thread performs the following operations:

  1. Compute Computes higher level statistics such as metric averages.
  2. Output Outputs raw and computed statistics to a trace logger.
  3. Trigger Triggers alerts based on conifigured configured alert thresholds.

The statistics thread can be started/stopped either programatically programmatically or via configuration parameters.

...

By default, an AEP engine does not collect message type speciifc specifc stats. The following configuration switches on collection of stats for message types:
nv.aep.<engineName>.msgtype.stats=true
For example, the following would switch on message type specific stats for the 'forwarder' engine.
nv.aep.forwarder.msgtype.stats=true

...

PhaseDescription
mpprocRecords the time (in microseconds) spent by the engine dispatching the message to an application.
mprocRecords latencies for application message process times (in an EventHandler).
mfiltRecords latencies for application message filtering times (by a message filter).
msend

Time spent in AepEngine.sendMessage().

The time in the AepEngine's send call. This latency will be a subset of mproc for solicited sends and it includes msendc.

msendc

Time spent in the AepEngine's core send logic.

This leg includes enqueuing the message for delivery in the corresponding bus manager.

cstartTime spent from the point the first message of a transaction is received to the time the transaction is committed.
cprolo

Time spent from the point where transaction commit is started to send or store commit, whichever is occurs first.

This latency measures the time taken in any bookkeeping done by the engine prior to commit the transaction to store (or for an engine without a store until outbound messages are released for delivery).

csend

The send commit latency: i.e. time from when send commit is initiated, to receipt of send completion event.

This latency represents the time from when outbound messages for a transaction are released to the time that all acknowledgements for the messages are received.

Because this latency includes acknowledgement time a high value for csend does not necessarily indicate that downstream latency will be affected. The Message Latencies listed below allow this value to be decomposied further.

ctrans

Time spent from the point the store commit completes to the beginning of the send commit which releases a transaction's outbound messages for delivery.

If the engine doesn't have a store, then this statistic is not captured as messages are released immediately.

cstore

The store commit latency i.e. time from when store commit is initiated to receipt of store completion event.

This latency includes the time spent serializing transaction contents, persisting to the store's transaction log, inter cluster replication, and replication to backup members including the replication ack.

(lightbulb) High values in cstore will impact downstream message latencies because store commit must complete before outbound messages are released for delivery. The cstore latency is further broken down in the Store Latencies listed below.

cepiloTime spent from the point the store or the send commit completes, whichever is last, to commit completion.
cfullTime spent from the time the first message of a transaction is received to commit completion.
tleg1

Records latencies for the first transaction processing leg.

Transaction Leg One includes time spent from the point where the first message of a transaction is received to submission of send/store commit. It includes message processing and and any overhead associated with transactional book keeping done by the engine.

(lightbulb) Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that an application can sustain in terms of transactions per second.

tleg2

Records latencies for the second transaction processing leg.

Transaction Leg Two includes time spent from the point where the send/store commit completion is received to the submission of store/send commit.

(lightbulb) Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that an application can sustain in terms of transactions per second.

tleg3

Records latencies for the third transaction processing leg.

Transaction Leg Three includes time spent from the point where the store/store commit completion is received to the completion of the transaction commit.

(lightbulb) Each transaction leg is a portion of the overall commit time that is processed on the Aep Engine's thread. The sum of the transaction leg stats are important in that they determine the overall throughput that an application can sustain in terms of transactions per second.

inout

Records latencies for receipt of a message to transmission of the last outbound message.

inackRecords latencies for receipt of a message to stabilization (and upstream acknowledgement for Guaranteed).

...

The engine stats thread can also trace summary statistics for its message buses. Each message bus is wrapped by a Bus Manager which handles bus connect and reconnect, and also provides transactional semantics around the bus by queuing up messages that will be sent as part of an engine transaction. Each Bus Manager maintains statistics for across bus binding reconnects, allowing continuous stats across bus binding reconnects. The following sections break these statistics down in more detail.

...

FieldDescription
NumMsgsRcvdThe number of message received by the bus.

NumMsgsInBatches

The number of message received by the bus that were part of a batch.
NumMsgBatchesRcvd

The number of batch message received by the bus.

NumPacketsRcvdThe number of raw packets received by the bus.
NumMsgsEnqueued

The total number of batch messages enqueued for delivery by this bus.

NumAcksSentThe total number of acknowledgement sent upstream for received messages by this bus.
NumStabilityRcvdThe number of stability events (acks) received by this bus.
NumStabilityBatchesRcvdThe number of batched stability events received by this bus.
NumMsgsEnqueued

The total number of batch messages enqueued for delivery by this bus.

NumMsgsSentThe total number of batch messages enqueued message that were actually sent by the bus.
NumFlushesSync

The number of times this bus has been synchronously flushed.

NumFlushesAsyncThe number of times this bus has been asynchronously flushed.
NumMsgsFlushedSyncThe number of messages flushed by synchronous flushes.
NumMsgsFlushedAsync

The number of messages flushed by asynchronous flushes.

NumAsyncFlushCompletionsThe number of asynchronous flushes for this that have completed.
NumCommitsThe number transactions committed by the bus.
NumRollbacks

The number transactions rolled back for the bus.

...

When a Bus Manager is configured for detached send (aka detached commit), a transaction's outbound messages are dispatched on the Bus Manager's I/O thread. The "Offer to Poll" latency measure the time from when messages are released for delivery after being stabilized in the store, until the detached bus manager picks them up for send. High o2p latencies in a Bus Manager may indicate that messages are being released for send faster than we can actually send them.

...

  1. the point at which an event is offered to the multiplexer by a thread and the point at which it is actually enqueued for processing – i.e. the o2p statistics traced per Feeder Queue:.

    Panel

    Feeder Queues (max=16, lastDecongest=0)
    ...X-Server-ems1-Main (aff=[]) has 0 (decongestCount=64500)
    ......[o2p] [sample=0, min=-1 max=-1 mean=-1 median=-1 75%ile=-1 90%ile=-1 99%ile=-1 99.9%ile=-1 99.99%ile=-1]
    ...X-ODS-StoreEventMultiplexer-1 (aff=[3(s0c3t0)]) has 0 (decongestCount=64492)
    ......[o2p] [sample=0, min=-1 max=-1 mean=-1 median=-1 75%ile=-1 90%ile=-1 99%ile=-1 99.9%ile=-1 99.99%ile=-1] 

  2. the time between an event being enqueued for processing and actually being dequeued for processing by the multiplexer – i.e. the o2p statistics traced for the Event Multiplexer's Disruptor:.

    Panel

    [Event Multiplexer (ems)]
    Disruptor (MultiThreadedSufficientCores, BusySpin)..
    [0 of 1,024] 0%
    ...[o2p] [sample=0, min=-1 max=-1 mean=-1 median=-1 75%ile=-1 90%ile=-1 99%ile=-1 99.9%ile=-1 99.99%ile=-1]

...

PhaseDescription
cqsThe number of entries committed per commit.

(warning) While this statistic is reported with latencies, it is not actually a latency statistic but rather captures the number of entries included in a commit. For Event Sourcing this will reflect the number of input event batched into a transaction, and for State Replication, this will reflect the number of state entries and outbound messages in the transaction.
s

The amount of time spent serializing transaction entries in preparation of replication / and persistence.

s2w

The time between serializing transaction entries to the last entry being written to the wire (but not stabilized) for replication.

(lightbulb) A high value here can indicate that there is network pushback from backup members.

s2p

The time between serializing transaction entries to the time that entries have been passed to the persister for write to disk (but not yet synced).

(lightbulb) A high value here can indicate that there is disk pushback or that there is overhead in preparing entries for persistence or ICR.  

w

The commit wire time.

For a store in the primary role, wire latency captures the time from the last commit entry being replicated until the last commit ack is received for the transaction. In other words: the round trip time.

For a store in the backup role wire latency capture the amount of time from when the primary wrote the last commit entry to the wire to the time it was received by the backup. When primary and backup are on different hosts, this statistic is subject to clock skew and could even be negative.

(lightbulb) A high value for 'w' on a Primary member may indicate a slow replication connection or can indicate that there is a bottleneck occurring in a backup member. If s2w is low, the latter case is more likely and backup stats should be consulted. A high value for 'w' is not necessarily a problem for applications that are more concerned with throughput provided there is enough bandwidth, but a high value can impact downstream message latency. For a backup member (discounting clock skew) a high value would indicate latency in the replication connection.

w2dThe time between receipt of a commit packet to the point at which deserialization of the entries has started (by a backup store member).
dThe time spent deserializing a transaction's entries (by a backup store member).
per

The amount of time spent persisting transaction entries to disk.

(lightbulb) A high value here indicates that disk is a bottleneck.

icr

The amount of time spent sending transaction entries to via an ICR Sender.

(lightbulb) A high value here indicates that intercluster replication is a bottleneck.

idxThe index latency records the time spent indexing records during commit.
cThe commit latency records the latency from commit start to the commit being stabilized to follower members and / or to disk.

...

A started engine statistics thread computes the following at the output frequency at which it was configured (programatically programmatically or administratively).

Anchor
_Toc221568558
_Toc221568558
Message Receipt

...