|
The platform provides the ability for users to define their own application specific stats. These user defined app stats can be registered with the AepEngine which allows them to be traced along with Aep Engine Stats, and included in server heartbeats. Applications can programatically register stats with the AepEngine, or when running in a Talon server be discovered via annotations.
This article describes the usage of the following types of statistics that applications can expose:
Stat Type | |
---|---|
Gauge | A Gauge samples and reports a value on each stats collection interval. Gauges can be exposed simply by annotating a field or method of interest. |
Counter | A Counter stat captures a monotonically increasing value, and can be used to derive rates based on deltas between intervals. |
Series | A Series stat allows recording of a series of data points upon which histographical computations can be reported. |
Latencies | A special case of a Series stat used to collected timing related data points. Latencies are used extensively within Core X to provide visibility into transaction processing pipeline times. |
A Gauge captures an instantaneous value at the time of a statistic collection. Gauges can be of the following types:
Gauge values are collected on a stats collection thread(s) separate from the business logic thread. Consequenlty:
Additionally for method gauges:
|
When running in a Talon Server, it is possible to annotate a field as a gauge:
import com.neeve.stats.*; public void MyApp() { @AppStat(name="Last Order ID Processed") private volatile int lastOrderNumber = -1; @EventHandler public void onNewOrder(NewOrderMessage message) { lastOrderNumber = message.getOrderId(); } } |
When running in a Talon Server, it is possible to annotate a method as a gauge accessor:
import com.neeve.stats.*; public void MyApp() { private volatile int numInvalidOrders; @EventHandler public void onNewOrder(NewOrderMessage message) { if(message.getQuantity() < 0) { numInvalidOrders.increment(); } } @AppStat(name = "Invalid Order Flag") public boolean getHasOrderErrors() { return numInvalidOrders > 0; } } |
Method accessor gauges for primitive type are not Zero Garbage. This is because the platform invokes getHasOrderErrors via reflection which generates autoboxing garbage. A better approach if your application is senstive to garbage is to use a Gauge subclass which can directly return the primitive type. |
You can subclass one of the XXXGauge implementations to avoid garbage associated with annotated a method. This is useful if your Gauge needs to be calculated or you are not running in a Talon Server and need to programmatically register a Gauge instance with the AepEngine.
import com.neeve.stats.*; public void MyApp() { private volatile int numInvalidOrders; @AppStat private final Gauge orderErrorsGauge = new BooleanGauge("Invalid Order Flag") { public boolean getBooleanVAlue() { return numInvalidOrders > 0; } }; @EventHandler public void onNewOrder(NewOrderMessage message) { if(message.getQuantity() < 0) { numInvalidOrders.increment(); } } } |
Note that in the above case the 'name' attribute is omitted on the AppStat annotation because it is provided directly when creating the Gauge.
Gauges can be read programmatically on server heartbeats:
public class MyStatsListener { @EventHandler public void onHeartbeat(SrvMonHeartbeatMessage message) { for (SrvMonAppStats appStats : message.getAppsStatsEmptyIfNull()) { for (SrvMonUserGaugeStat gauge: appStats.getUserStats().getGaugesEmptyIfNull()) { System.out.println(gauge.getName() + ": " + SrvMonUtil.getGaugeValue(gauge) + " (" + gauge.getGaugeType() + ")"); } } } } |
yields
Invalid Order Flag: true Last Order ID Processed: 10 |
[User Gauge Stats] ...Invalid Orders: 9 ...Last Order ID Processed: 10 |
Note that in the above examples that gauges fields are declared as volatile. This is because gauge values are collected by the statistics thread that is emitting server heartbeats, not the application's business logic thread.
A Counter
is useful for recording a monotonically increasing value over time. Sampled periodically it can be used to derive a rate. For example a counter could be used to record a number of message received. By sampling it over time it can be used to create a received message rate.
import com.neeve.stats.IStats.Counter; import com.neeve.stats.StatsFactory; @AppStat private final Counter numInvalidOrders = StatsFactory.createCounterStat("Invalid Orders"); public void MyApp() { @EventHandler public void onNewOrder(NewOrderMessage message) { if(message.getQuantity() < 0) { numInvalidOrders.increment(); } } } |
If Aep engine stats tracing is enabled, the above stat will be printed along with the rest of engine stats in the format
<overallCount> <lastIntervalCount> (<overallRate> <lastIntervalRate>):
[User Counter Stats] ...Invalid Orders: 9 1 (1.01 1) |
So from the above we can see that there were 9 invalid orders in the lifetime of the app, 1 invalid order in the last interval, and that the app is receiving a little over 1 invalid order / sec.
User stats are also included in Server Heartbeats, the following code iterates through all user Counter stats and prints them out.
public class MyStatsListener { @EventHandler public void onHeartbeat(SrvMonHeartbeatMessage message) { for (SrvMonAppStats appStats : message.getAppsStatsEmptyIfNull()) { for (SrvMonUserCounterStat counter : appStats.getUserStats().getCounters()) { System.out.println(counter.getName() + ": " + counter.getCount()); } } } } |
Invalid Orders: 9 |
Series stats allow capture of a series of datapoints and allow reporting of histographical statistics based on that series.
A common usecase for a Series statistic is collecting Latency timing data. In that one would like to be able to observe median, min, max 99.99% for message processing times to ensure that SLAs are being met. However Non-lossy collection and reporting of histographical latency statistics is a challenging problem in low latency systems due to the number of data points that need to be retained, computed and serialized. For example, imagine an application that is recording latency statistics for messages coming in at a rate of 10k/sec. To accurately compute and report percentiles with a collection period of 10 seconds, this would mean that the application needs to retain at least 100,000 data points per statistic to perform histographical analysis for just one interval! Assuming that the values are double or long values then one would be looking at ~800Kb per statistic collected. Collecting and computing on such data is hard on processor memory caches and can have a disruptive impact on application processing times. Furthermore to perform longer term histographical analysis (across multiple collection periods) without losing any data, each set of interval results needs to be stored so that computation can be performed. Persisting such data to disk or emitting it in server heartbeats to achieve this is also problematic because it leads a large volume of data which puts strain on disk space and bandwidth, or in the case of heartbeats network bandwidth when emitted over the messaging fabric.
The X Platform supports the ability to perform loss-less series capture by allowing all collecting latencies timing datapoints to be emitted in heartbeats. Providing that the collection period doesn't exceed the data point capture rate, every datapoint can be emitted in heartbeats (which can be logged to disk or emitted over an SMA channel). However, this approach should be use sparingly as it is quite expensive.
As an alternative to reporting all captured data points, 3.1 introduces computed histogram reporting based on HDRHistogram which significantly reduces the size of heartbeats by maintaining a running computation of latency statistics. At each collection interval the captured latencies are fed into both a running histogram and an interval histogram.
Running Stats: These stats allow a monitoring application to connect at any time and get a view into the historical latency statistics (at least since they were last reset). Interval Stats: These stats allow a monitoring application to get an instantaneous view of the statistic over a recent time window. |
An HDRHistogram compromises on precision of the captured latencies in favor of cheaper computation and storage of results while still maintaining a predictable precision. The documentation on HDR histogram provides details on the level of precision that is achieved. Practically speaking however, for latency data points in the 100s of microseconds the precision that is guaranteed for collected percentiles is in the order of +/- 1us which is acceptable for most applications (for tail values say in the range of 1 minute the value is guaranteed to be correct within +/- 60ms
import com.neeve.stats.IStats.Series; import com.neeve.stats.StatsFactory; @AppStat private final Series newCustomerAge = StatsFactory.createSeriesStat("New Customer Age"); public void MyApp() { @EventHandler public void onNewCustomer(NewCustomerCreation message) { newCustomerAge .add(message.getQuantity()); } } |
When Aep Engine Statistics are enabled, the statistic would then be traced:
[App (myapp) User Stats] ...Series{ ......[New Customer Age(sno=8, #points=1, #skipped=0) .........New Customer Age(interval): [sample=1, min=21 max=21 mean=21 median=21 75%ile=21 90%ile=21 99%ile=21 99.9%ile=21 99.99%ile=21] .........New Customer Age (running): [sample=8, min=21 max=29 mean=23 median=21 75%ile=22 90%ile=23 99%ile=29 99.9%ile=29 99.99%ile=29] ...} |
In the above we can see that the in last interval, one new customer registered and their age was 21. Over the last 8 intervals, the average new customer age is 23 with the oldest being 29 and the youngest being 21.
Series data for user stats are exposed in the Server Monitoring Heartbeat using SrvMonUserSeriersStat object:
public class MyStatsListener { @EventHandler public void onHeartbeat(SrvMonHeartbeatMessage message) { for (SrvMonAppStats appStats : message.getAppsStatsEmptyIfNull()) { for (SrvMonUserSeriesStat series: appStats.getUserStats().getSeries()) { System.out.println(series.getName() + ": mean: " + series.getIntSeries().getRunningStats().getMean()); } } } } |
Reports an application defined series statistic.
Field Name | type | Description | |
---|---|---|---|
name | String | When the server is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non lossy calculation of percentiles providing new data points were skipped due to under sampling or a missed heartbeat. | |
seriesType | SrvMonSeriesType | The type of the series data.
| |
intSeries | SrvMonIntSeries | The collected int series data for an INT series. This field should only be set when the series type is set to SrvMonSeriesType.INT. |
Latency statistics are reported in a SrvMonIntSeries object.
SrvMonIntSeries reports interval and running histogram data for a series of integer data points. It may also be used to report the captured datapoints, but because reporting the raw data is costly (both in terms of collection and size/bandwidth), the captured values are typically not reported.
SrvMonIntSeries is frequently used to capture measured latency timings, but can also be used to capture any integer data series.
Field Name | type | Description |
---|---|---|
dataPoints | int[] | When the server is configured to include the capture data points for the statistic, the returned array will include the values collected during this interval. This allows monitoring tools to perform non lossy calculation of percentiles providing new data points were skipped due to under sampling or a missed heartbeat. |
lastSequenceNumber | long | Sequence numbers for collected data points start at 1, a value of 0 indicates that no data points have been collected. The Sequence Number always indicates the number or data points that have been collected since the statistic has been created or was last reset. |
numDataPoints | int | Indicates the number of data points collected in this interval. If no data points were collected, numDataPoints will be 0. |
skippedDataPoints | long | The runtime only holds on to a fixed number of data points for any particular Latency statistic. If the sampling interval is too high then some datapoints may be skipped. For example let's say Latency stats are configured to hold on to a sample size of 1000 datapoints. If the number of data points being capture per second is 2000, and the stats collection interval is 1 second then on each collection, 1000 datapoints will be missed which will skew results. |
SrvMonIntHistogram | intervalStats | Holds computed results for the datapoints captured for this heartbeat (e.g. for the numDataPoints captured). This field may not be set if numDataPoints is 0 or if interval computations are not done on the server.NEW IN 3.1 |
SrvMonIntHistogram | runningStats | Holds computed results for the datapoints over the lifetime of this statistic (e.g. since seqNo 1). If the underlying statistic is reset then the running stats are also corresponding reset. |
Holds calculated statistics of a range of integer datapoints. The values are computed using an HDRHistogram.
Field Name | type | Description |
---|---|---|
sampleSize | long | The number of datapoints over which results were calculated (possibly 0 if no data points were collected). |
minimum | int | The minimum value recorded in the sample set. The value is not set if the sample size is 0. |
maximum | int | The maximum value recorded in the sample set. The value is not set if the sample size is 0. |
mean | int | The mean for the values recorded in the sample set. The value is not set if the sample size is 0. |
median | int | The median for the values recorded in the sample set. The value is not set if the sample size is 0. |
pct75 | int | The 75th percentile for the values recorded in the sample set. The value is not set if the sample size is 0. |
pct90 | int | The 90th percentile for the values recorded in the sample set. The value is not set if the sample size is 0. |
pct99 | int | The 99th percentile for the values recorded in the sample set. The value is not set if the sample size is 0. |
pct999 | int | The 99.9th percentile for the values recorded in the sample set. The value is not set if the sample size is 0. |
pct9999 | int | The 99.99th percentile for the values recorded in the sample set. The value is not set if the sample size is 0. |
samplesOverMax | long | The number of samples that exceeded the maximum recordable value for the histogram. When computing latency percentiles using an HDRHistogram it is possible that a recorded value will exceed the maximum value allowable. In this case the datapoint in downsampled to the maximum recordable value which skews the percentile calculations lower. SamplesOverMax allows detection of how frequently this is occuring. |
samplesUnderMin | long | The number of samples capture that were below the recordable value for the histogram. When computing latency percentiles using an HDRHistogram it is possible that a recorded value will be below 0 in cases where clock skew is possible. In such cases the the value will be upsampled to 0 which can skew the histogram results. SamplesUnderMin allows detection of how frequently this is happening. |
See also: ISrvMonUserSeriesStat, ISrvMonIntSeries and ISrvMonIntHistogram for details on the values they report. |
The Latencies stats is an extension of the Series stat. When capturing latency or timing data, it is good practice to use Latencies instead of Series.
import com.neeve.stats.IStats.Latencies; import com.neeve.stats.StatsFactory; @AppStat private final Latencies orderPrepTime = StatsFactory.createLatencyStat("Order Prep Times"); public void MyApp() { @EventHandler public void onNewOrder(NewOrderMessage message) { long receiveTs= UtlTime.now(); //Do some stuff ... //Prepped ... capture prep time. orderPrepTime.add(UtlTime.now() - receiveTs); } } |
The AppStat annotation can be used to annotate user defined statistics in the application to allow them to be discovered by a Talon Server. The Talon server will register each statistic if finds with the application's AepEngine. AppStat annotations are only introspected once: just after the application's AepEngine is injected. If the application changes the instance after application initialization the new stat instance won't be discovered by the application.
Any @AppStat annotated field in the main application class will be discovered by the Talon server, if additional classes in your application contain user defined stats, they can be exposed to the server using the AppStatContainerAccessor annotation:
@AppHAPolicy(HAPolicy.EventSourcing) public static class MyApp { MyOtherClass someOtherClass = new MyOtherClass(); @AppStatContainersAccessor public void getStatContainers(Set<Object> containers) { containers.add(someOtherClass ); } } private static class MyOtherClass { @AppStat Counter numHeartbeats = StatsFactory.createCounterStat("Heartbeats Received"); StatContainer() { } } |
For Topic Oriented Applications, any @Managed object will be introspected for User Defined stats. See ManagedObjectLocator. The DefaultManagedObjectLocator for Hornet calls TopicOrientedApplication.addAppStatContainers(Set), so unless your application provides its own managed object locator, additional user defined stats containers can be added by overriding addAppStatsContainers:
@AppHAPolicy(HAPolicy.EventSourcing) public static class MyApp extends TopicOrientedApplication { MyOtherClass someOtherClass = new MyOtherClass(); @Override public void addAppStatsContainers(Set<Object> containers) { containers.add(someOtherClass ); } } private static class MyOtherClass { @AppStat Counter numHeartbeats = StatsFactory.createCounterStat("Heartbeats Received"); StatContainer() { } } |
When running in a Talon Server, the server registers discovered App Stats with the AepEngine. When not running in a Talon Server, user defined stats may be registered programmatically with the AepEngine by calling the appropriate register method:
Type | |
---|---|
Counter | registerCounterStat(IStats.Counter counter) |
Gauge | registerGaugeStat(IStats.Gauge gauge) |
Series | registerSeriesStat(IStats.Series series) |
If not registered with the engine, app stats will not be collected with other engine stats when engine stats are enabled.
Registration of User Defined stats is only supported prior to engine startup. |