The Talon Manual

Skip to end of metadata
Go to start of metadata

SINCE 3.1

Overview

The platform provides the ability for users to define their own application specific stats. These user defined app stats can be registered with the AepEngine, which allows them to be traced along with Aep Engine Stats, and be included in server heartbeats. Applications can programmatically register stats with the AepEngine, or when running in a Talon server to be discovered via annotations. 

This article describes the usage of the following types of statistics that applications can expose:

Stat Type 
GaugeA Gauge samples and reports a value on each stats collection interval. Gauges can be exposed simply by annotating a field or method of interest.
CounterA Counter stat captures a monotonically increasing value, and can be used to derive rates based on deltas between intervals.
SeriesA Series stat allows recording of a series of data points upon which histographical computations can be reported.
LatenciesA special case of a Series stat used to collected timing related data points. Latencies are used extensively within Core X to provide visibility into transaction processing pipeline times.

 

Gauges

A Gauge captures an instantaneous value at the time of a statistic collection. Gauges can be of the following types:

  • boolean
  • byte
  • short
  • int
  • long
  • float
  • double
  • char
  • String or XString
 

 

Important considerations regarding gauge collection

Gauge values are collected on a stats collection thread(s) separate from the business logic thread. Consequently:

  • Gauge field values must be declared as volatile to ensure changes to them are visible to collection threads.

Additionally, for method gauges:

  • Be sure that the computation cost is not so high that it skews statistics collection. Consider using a background thread for computing gauge values that are computationally expensive.
  • It is possible that more than one stats collection thread will be collecting and reporting on stats concurrently, so method gauges should be threadsafe.

 

Field Gauges

When running in a Talon Server, it is possible to annotate a field as a gauge:

 

Method Gauges

When running in a Talon Server, it is possible to annotate a method as a gauge accessor:

Method accessor gauges for primitive type are not Zero Garbage. This is because the platform invokes getHasOrderErrors via reflection which generates autoboxing garbage. A better approach, if your application is sensitive to garbage, is to use a Gauge subclass which can directly return the primitive type.

Gauge Subclass Field

You can subclass one of the XXXGauge implementations to avoid garbage associated with an annotated method. This is useful if your Gauge needs to be calculated or you are not running in a Talon Server and you need to programmatically register a Gauge instance with the AepEngine. 

Note that in the above case the 'name' attribute is omitted on the AppStat annotation because it is provided directly when creating the Gauge. 

Gauges on Server Heartbeats

Gauges can be read programmatically on server heartbeats:

yields

Invalid Order Flag: true
Last Order ID Processed: 10

Gauges in Aep Engine Stats Trace

[User Gauge Stats]
...Invalid Orders: 9
...Last Order ID Processed: 10

Threading Considerations for Gauges

Note that in the above examples, gauges fields are declared as volatile. This is because gauge values are collected by the statistics thread that is emitting server heartbeats, not the application's business logic thread.

Counters

A Counter is useful for recording a monotonically increasing value over time. Sampled periodically, it can be used to derive a rate. For example, a counter could be used to record a number of message received. By sampling it over time, it can be used to create a received message rate.

If Aep engine stats tracing is enabled, the above stat will be printed along with the rest of engine stats in the format.

<overallCount> <lastIntervalCount> (<overallRate> <lastIntervalRate>):

[User Counter Stats]
...Invalid Orders: 9 1 (1.01 1) 

From the above, we can see that there were 9 invalid orders in the lifetime of the app, 1 invalid order in the last interval, and that the app is receiving a little over 1 invalid order / sec. 

User stats are also included in Server Heartbeats, the following code iterates through all user Counter stats and prints them out. 

Invalid Orders: 9

Series

Series stats allow capture of a series of datapoints and allow reporting of histographical statistics based on that series. 

A common usecase for a Series statistic is collecting Latency timing data. In that, one would like to be able to observe median, min, max 99.99% for message processing times to ensure that SLAs are being met. However, Non-lossy collection and reporting of histographical latency statistics is a challenging problem in low latency systems due to the number of data points that need to be retained, computed and serialized. For example, imagine an application that is recording latency statistics for messages coming in at a rate of 10k/sec. To accurately compute and report percentiles with a collection period of 10 seconds, the application needs to retain at least 100,000 data points per statistic to perform histographical analysis for just one interval! Assuming that the values are double or long values, then one would be looking at ~800Kb per statistic collected. Collecting and computing on such data is hard on processor memory caches and can have a disruptive impact on application processing times. Furthermore, to perform longer term histographical analysis (across multiple collection periods) without losing any data, each set of interval results needs to be stored so that computation can be performed. Persisting such data to disk or emitting it in server heartbeats to achieve this is also problematic because it leads to a large volume of data which puts a strain on disk space and bandwidth, or in the case of heartbeats, network bandwidth when emitted over the messaging fabric. 

Loss-less series stats collection

The X Platform supports the ability to perform loss-less series capture by allowing all collecting latencies timing datapoints to be emitted in heartbeats. Providing that the collection period doesn't exceed the data point capture rate, every datapoint can be emitted in heartbeats (which can be logged to disk or emitted over an SMA channel). However, this approach should be use sparingly as it is quite expensive.

Histogram (HDR) collection

As an alternative to reporting all captured data points, 3.1 introduces computed histogram reporting based on HDRHistogram which significantly reduces the size of heartbeats by maintaining a running computation of latency statistics. At each collection interval the captured latencies are fed into both a running histogram and an interval histogram.

 

Running Stats: These stats allow a monitoring application to connect at any time and get a view into the historical latency statistics (at least since they were last reset).

Interval Stats: These stats allow a monitoring application to get an instantaneous view of the statistic over a recent time window.

 

An HDRHistogram compromises on precision of the captured latencies in favor of cheaper computation and storage of results while still maintaining a predictable precision. The documentation on HDR histogram provides details on the level of precision that is achieved. Practically speaking, however, for latency data points in the 100s of microseconds the precision that is guaranteed for collected percentiles is in the order of +/- 1us, which is acceptable for most applications (for tail values, say in the range of 1 minute, the value is guaranteed to be correct within +/- 60ms).

Creating a Series Stat

When Aep Engine Statistics are enabled, the statistic would then be traced:

[App (myapp) User Stats]
...Series{
......[New Customer Age(sno=8, #points=1, #skipped=0)
.........New Customer Age(interval): [sample=1, min=21 max=21 mean=21 median=21 75%ile=21 90%ile=21 99%ile=21 99.9%ile=21 99.99%ile=21]
.........New Customer Age (running): [sample=8, min=21 max=29 mean=23 median=21 75%ile=22 90%ile=23 99%ile=29 99.9%ile=29 99.99%ile=29]
...}

In the above we can see that in the last interval, one new customer registered and their age was 21. Over the last 8 intervals, the average new customer age is 23 with the oldest being 29 and the youngest being 21. 

Series Data in Server Hearbeats

Series data for user stats are exposed in the Server Monitoring Heartbeat using the SrvMonUserSeriersStat object:

SrvMonUserSeriesStat

Reports an application defined series statistic.

Field Name
type
Description
nameString

When the server is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non-lossy calculation of percentiles, providing new data points were skipped due to under sampling or a missed heartbeat.

Then number of valid values in the returned array is dicated by numDataPoints; if the length of the values array is longer than numDataPoints, subsequent values in the array should be ignored.

seriesTypeSrvMonSeriesType

The type of the series data.

Currently only Integer Data series are supported. The types BYTE, SHORT, LONG, FLOAT and DOUBLE are reserved for future use. Processors of heartbeats should ensure that they check the data type here for future proofing.

intSeriesSrvMonIntSeries

The collected int series data for an INT series.

This field should only be set when the series type is set to SrvMonSeriesType.INT.

SrvMonIntSeries

Latency statistics are reported in a SrvMonIntSeries object. 

SrvMonIntSeries reports interval and running histogram data for a series of integer data points. It may also be used to report the captured datapoints, but because reporting the raw data is costly (both in terms of collection and size/bandwidth), the captured values are typically not reported.

SrvMonIntSeries is frequently used to capture measured latency timings, but can also be used to capture any integer data series.

Field Name
type
Description
dataPointsint[]

When the server is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non-lossy calculation of percentiles, providing new data points were skipped due to under sampling or a missed heartbeat.

The number of valid values in the returned array is dictated by numDataPoints; if the length of the values array is longer than numDataPoints, subsequent values in the array should be ignored.

lastSequenceNumberlong

Sequence numbers for collected data points start at 1, a value of 0 indicates that no data points have been collected.

The Sequence Number always indicates the number or data points that have been collected since the statistic has been created or was last reset. 
If the statistic is reset then this value will reset to 0, when

numDataPointsint

Indicates the number of data points collected in this interval. If no data points were collected, numDataPoints will be 0. 

The sequence number of the first value collected in this interval can be determined by subtracting numDataPoints from lastSequenceNumber. This can be used to determine if two consecutive datapoints have skipped data points due to under sampling or a missing heartbeat.

skippedDataPointslong

The runtime only holds on to a fixed number of data points for any particular Latency statistic. If the sampling interval is too high, then some datapoints may be skipped. For example, let's say Latency stats are configured to hold on to a sample size of 1000 datapoints. If the number of data points being captured per second is 2000, and the stats collection interval is 1 second, then on each collection, 1000 datapoints will be missed, which will skew results. 

The skipped data points counter thus indicates how many data points have been missed in the reported runningStats. And if the count grows over two successive heartbeats, this indicates that the values the intervalStats don't reflect all the activity since the last interval.

The skipped data points counter is a running counter: it tracks the total number of data points that have been skipped since the underlying statistic was last reset.NEW IN 3.1

SrvMonIntHistogram

intervalStats

Holds computed results for the datapoints captured for this heartbeat (e.g. for the numDataPoints captured).

This field may not be set if numDataPoints is 0 or if interval computations are not done on the server.NEW IN 3.1

SrvMonIntHistogram 

runningStats

Holds computed results for the datapoints over the lifetime of this statistic (e.g. since seqNo 1).

If the underlying statistic is reset then the running stats are also corresponding reset.

SrvMonIntHistogram 

Holds calculated statistics of a range of integer datapoints. The values are computed using an HDRHistogram.

Field Name
type
Description
sampleSizelong

The number of datapoints over which results were calculated (possibly 0 if no data points were collected).

minimumint

The minimum value recorded in the sample set.

The value is not set if the sample size is 0.

maximumint

The maximum value recorded in the sample set.

The value is not set if the sample size is 0.

meanint

The mean for the values recorded in the sample set.

The value is not set if the sample size is 0.

medianint

The median for the values recorded in the sample set.

The value is not set if the sample size is 0.

pct75int

The 75th percentile for the values recorded in the sample set.

The value is not set if the sample size is 0.

pct90int

The 90th percentile for the values recorded in the sample set.

The value is not set if the sample size is 0.

pct99int

The 99th percentile for the values recorded in the sample set.

The value is not set if the sample size is 0.

pct999int

The 99.9th percentile for the values recorded in the sample set.

The value is not set if the sample size is 0.

pct9999int

The 99.99th percentile for the values recorded in the sample set.

The value is not set if the sample size is 0.

samplesOverMaxlong

The number of samples that exceeded the maximum recordable value for the histogram.

When computing latency percentiles using an HDRHistogram, it is possible that a recorded value will exceed the maximum value allowable. In this case, the datapoint is downsampled to the maximum recordable value, which skews the percentile calculations lower. SamplesOverMax allows detection of how frequently this is occurring.

samplesUnderMinlong

The number of samples captured that were below the recordable value for the histogram.

When computing latency percentiles using an HDRHistogram, it is possible that a recorded value will be below 0 in cases where clock skew is possible. In such cases, the value will be upsampled to 0, which can skew the histogram results. SamplesUnderMin allows detection of how frequently this is happening.

See also: ISrvMonUserSeriesStatISrvMonIntSeries and ISrvMonIntHistogram for details on the values they report. 

Latencies

The Latencies stats is an extension of the Series stat. When capturing latency or timing data, it is good practice to use Latencies instead of Series. 

User Defined Statistic Discovery

@AppStat Annotation

The AppStat annotation can be used to annotate user defined statistics in the application to allow those statistics to be discovered by a Talon Server. The Talon server will register each statistic it finds with the application's AepEngine. AppStat annotations are only introspected once: just after the application's AepEngine is injected. If the application changes the instance after application initialization, the new stat instance won't be discovered by the application. 

@AppStatContainersAccessor

Any @AppStat annotated field in the main application class will be discovered by the Talon server: if additional classes in your application contain user defined stats, they can be exposed to the server using the AppStatContainerAccessor annotation.

AppStat Discovery in Hornet

For Topic Oriented Applications, any @Managed object will be introspected for User Defined stats. See ManagedObjectLocator. The DefaultManagedObjectLocator for Hornet calls TopicOrientedApplication.addAppStatContainers(Set), so unless your application provides its own managed object locator, additional user defined stats containers can be added by overriding addAppStatsContainers:

Programmatically Registering Stats

When running in a Talon Server, the server registers discovered App Stats with the AepEngine. When not running in a Talon Server, user defined stats may be registered programmatically with the AepEngine by calling the appropriate register method:

Type 
CounterregisterCounterStat(IStats.Counter counter) 
GaugeregisterGaugeStat(IStats.Gauge gauge) 

Series
Latencies 

registerSeriesStat(IStats.Series series) 

If not registered with the engine, app stats will not be collected with other engine stats when engine stats are enabled. 

Registration of User Defined stats is only supported prior to engine startup.