In This Section

Overview

An operational XVM continuously collects raw statistics during the course of its operation. The xvm can also be configured to spin up a background thread that periodically performs the following:

Performs higher level statistical computations such as calculating message rates and average latencies.
Emits heartbeat messages to be processed by handlers.
Optionally outputs rendered stats to a trace logger which is useful in testing and diagnose situations.
Optionally writes heartbeat messages containing useful server-wide statistics to a binary transaction log (with zero steady-state allocations) which is useful for zero garbage capture of performance in production.

The raw metrics collected by the server are used by the background statistical thread for its computations and can also be retrieved programmatically by an application for its own use.
In this document, we describe:

how enable and configure xvm stats collection and emission.
the higher level statistics calculations performed by the statistics thread,
and the format of the output of the statistics thread.

Enabling Heartbeats

Heartbeats for an XVM can be enabled via DDL XML using the <heartbeats> element:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <collectNonZGStats>true</collectNonZGStats>
      <collectIndividualThreadStats>true</collectIndividualThreadStats>
      <collectPoolStats>true</collectPoolStats>
      <poolDepletionThreshold>1.0</poolDepletionThreshold>
      <collectSeriesStats>true</collectSeriesStats>
      <collectSeriesDatapoints>false</collectSeriesDatapoints>
      <maxTrackableSeriesValue>100000000</maxTrackableSeriesValue>
      <logging enabled="true"></logging>
      <tracing enabled="true"></tracing>
    </heartbeats>
  </server>
</servers>

Configuration Setting

Default

Description

enabled

false

Enable or disable server stats collection and heartbeat emission.

Collection of stats and emission of heartbeats can impact application performance from both a latency and throughput standpoint. For applications that are particularly sensitive to performance, it is a good idea to compare performance with and without heartbeats enabled to understand the overhead that is incurred by enabling heartbeats.

interval

1000

The interval in seconds at which server stats will be collected and emitted.

collectNonZGStats

true

Some statistics collected by the stats collection thread require creating a small amount of garbage. This can be set to false to supress collection of these stats.

collectIndividualThreadStats

true

Indicates whether heartbeats will contains stats for each active thread in the JVM. Individual thread stats are useful

collectSeriesStats

true

Indicates whether or not series stats should be included in heartbeats.

collectSeriesDatapoints

false

Indicates whether or not series stats should report the data points captured for a series statistic.

Enabling this value includes each datapoint collected in a series in heartbeats which can make emitted heartbeats very large and slow down their collection. It is not recommended that this be run in production.

maxTrackableSeriesValue

10 minutes

The maximum value (in microseconds) that can be tracked for reported series histogram timings. Datapoints above this value will be downsampled to this value, but will be reflected in the max value reported in an interval.

collectPoolStats

true

Indicates whether or not pool stats are collected by the server when nv.server.stats.enable=true.

poolDepletionThreshold

1.0

Configuration property used to set the percentage decrement at which a preallocated pool must drop to be included in a server heartbeat. Setting this to a value greater than 100 or less than or equal to 0 disables depletion threshold reporting.

This gives monitoring applications advanced warning if it looks like a preallocated pool may soon be exhausted. By default the depletion threshold is set to trigger inclusion in heartbeats at every 1% depletion of the preallocated count. This can be changed by specifying the configuration property nv.server.stats.pool.depletionThreshold to a float value between 0 and 100.

For example:
If a pool is preallocated with 1000 items and this property is set to 10, pool stats will be emitted for the pool each time a heartbeat occur and the pool has dropped below a 10% threshold of the preallocated size e.g. at 900, 800, 700, until its size reaches 0 (at which point subsequent misses would cause it to be included on every heartbeat).

Setting this to a value greater than 100 or less than or equal to 0 disables depletion threshold reporting.

logging

false

Configures binary logging of heartbeats.

Binary heartbeat logging provides a means by which heartbeat data can be captured in a zero garbage fashion. Collection of such heartbeats can be useful in diagnosing performance issues in running apps.

tracing

false

Configures trace logging of heartbeats.

Binary heartbeat logging provides a means by which heartbeat data can be captured in a zero garbage fashion. Collection of such heartbeats can be useful in diagnosing performance issues in running apps.

Enabling Global Stats

An XVM collects stats that are enabled for the applications that it contains. The followings stats can be enabled and reported in heartbeats

Environment Properties

<env>
  <!-- global stats properties -->
  <nv>
    <stats.latencymanager.samplesize>65536</stats.latencymanager.samplesize>
    <msg.latency.stats>true</msg.latency.stats>
    <ods.latency.stats>true</ods.latency.stats>
    <link.network.stampiots>true</link.network.stampiots>
  </nv>
</env>

Environment Prop	Description
nv.stats.latencymanager.samplesize	The global default size used for capturing latencies. Latency stats are collected in a ring buffer which is sampled by the stats thread at each collection interval. This should be sized large enough such that datapoints aren't missed, but not so large that it adversely affects processor cache performance.
nv.msg.latency.stats	This global property instructs the platform to collect latency statistics for messages passing through various points in the process pipeline. when not enabled many latency stats will not be available in heartbeats.
nv.ods.latency.stats	Globally enables collection of application store latencies.
nv.link.network.stampiots	Instructs low level socket I/O stamp input/output times on written data. This should be enabled to capture store latencies based on wire time, or for certain latencies in the direct message bus binding.

Engine Stats

The xvm statistics thread will collect the following latency stats from the apps it contains when they are enabled

<apps>
  <app name="my-app" ...>
    <captureTransactionLatencyStats>true</captureTransactionLatencyStats>
    <captureEventLatencyStats>true</captureEventLatencyStats>
    <captureMessageTypeStats>false</captureMessageTypeStats>
  </app>
</apps>

Per message type latency stats are not currently emitted in heartbeats due to concerns over the size they would contribute to heartbeats. Instead, they can be traced using the aep engine stats thread.

See Also:

DDL Config Reference for more details about the above stats.
AEP Engine Statistics for more detail about app-level stats.

Handling XVM Heartbeats

When heartbeats are enabled they can be consumed or emitted in several ways, discussed below.

Tracing Heartbeats

By default all server statistics tracers are disabled as trace logging is not zero garbage and introduces cpu overhead in computing statistics. While tracing heartbeats isn't recommended in production, enabling server statistics trace output can be useful for debugging and performance tuning. To enable you will need to configure the appropriate tracers at the debug level. See the Output Trace Loggers section for more detail.

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <logging enabled="true"></logging>
      <tracing enabled="true"></tracing>
    </heartbeats>
  </server>
</servers>

Binary Heartbeat Logging

Applications that are latency sensitive might prefer to leave all tracers disabled to avoid unnecessary allocations and the associated GC activity. As an alternative, it's possible to enable logging of zero-garbage heartbeat messages to a binary transaction log:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <logging enabled="true">
         <storeRoot>/path/to/heartbeat/log/directory</storeRoot>
      </logging>
    </heartbeats>
  </server>
</servers>

Heartbeat logs can be queried and traced out of process using the Stats Dump Tool.

Note that at this time binary heartbeat logs do not support rolling collection. Consequently this mechanism is not suitable for long running application instances.

See Also:

DDL Config Reference for more options that can be used to configure heartbeat logging.

Heartbeat Event Handlers

Your application can register an event handler for server heartbeats to handle them in process.

@EventHandler
public void onHeartbeat(SrvMonHeartbeatMessage message) {
   // Your logic here:
   // - You could emit over an SMA message bus.
   // - log to a time series database.
   // etc, etc.
}

See the SrvMonHeartbeatMessage JavaDoc for API details.

Admin Clients

A Lumino or Robin controller can also be used to connect to a server via a direct admin connection over TCP to listen for heartbeats for monitoring purposes. The XVMs stats thread will queue copies of each emitted heartbeats to each connected admin client.

Hearbeat Output

When heartbeat tracing is enabled heartbeat trace is emitted to the nv.server.heartbeat logger at a level of info. This section discussed the output format of heartbeats.

See Also:

X Platform Tracing and Logging for general details on configuration of trace logging.

System Stats

Appears in trace output when nv.server.stats.enable=true and nv.server.stats.sys.trace=debug

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <tracing enabled="true">
        <traceSysStats>true</traceSysStats>
      </tracing>
    </heartbeats>
  </server>
</server>

[System Stats]
Tue Jun 28 17:26:03 PDT 2016 'ems1' server (pid=50360) 2 apps (collection time=73982839 ns)
System: 20 processors load average: 2.30
Memory: HEAP 8.0G init 6.6G used 7.8G commit 7.8G max NON-HEAP 2M init 44M used 46M commit 0K max
Threads: 26 total (23 daemon) 27 peak
JIT: HotSpot 64-Bit Tiered Compilers, time: 11997 ms
[6 6404 ms] [2 8771 ms]

General Info

Date and time that statistics gathering started
Server name
Server PID
Number of apps running in the server
Time spent gathering server statistics (for the current interval, excluding formatting)

System Info

Number of available processors
System load average

Memory Info

Initial heap size
Heap used
Heap committed
Max heap size
Initial non-heap size
Non-heap memory used
Non-heap memory committed
Non-heap memory max size

Thread Info

Total thread count
Daemon thread count
Peak thread count

JIT Info

JIT name
Total compilation time

GC Info

Collection count (for all GCs)
Collection time (for all GCs)

Thread Stats

Individual thread stats can be traced by setting the following in DDL:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <collectIndividualThreadStats>true</collectIndividualThreadStats>
      <tracing enabled="true">
        <traceThreadStats>true</traceThreadStats>
      </tracing>
    </heartbeats>
  </server>
</server>

When enabled the following stats are traced to the console.

ID    CPU       DCPU    DUSER   CPU%  USER% STATE           NAME
2     4.6ms     0       0       0     0     WAITING         Reference Handler
3     4.8ms     0       0       0     0     WAITING         Finalizer
4     31.9us    0       0       0     0     RUNNABLE        Signal Dispatcher
5     37.8ms    85.6us  85.6us  1     100   TIMED_WAITING   JFR request timer
7     635.5ms   3.1ms   3.1ms   1     100   RUNNABLE        VM JFR Buffer Thread
25    90.7ms    365.3us 365.3us 1     100   RUNNABLE        X-EDP-McastReceiver (aff=[1(s0c1t0)])
26    37.8ms    142.7us 142.7us 1     100   TIMED_WAITING   X-EDP-Timer (aff=[1(s0c1t0)])
28    1.1s      2.5ms   2.5ms   1     101   RUNNABLE        X-Server-my-xvm-StatsRunner (aff=[1(s0c1t0)])
29    672.7ms   433.8us 433.8us 1     100   RUNNABLE        X-Server-my-xvm-Main (aff=[])
31    75.5ms    451.8us 451.8us 1     100   TIMED_WAITING   Wrapper-Control-Event-Monitor
33    33.7ms    110.3us 110.3us 1     100   RUNNABLE        Wrapper-Connection
34    1.2s      0       0       0     0     RUNNABLE        DestroyJavaVM
45    1.1s      0       0       0     0     WAITING         X-STEMux-admin-1 (aff=[])
46    14.5m     5.2s    5.2s    100   100   RUNNABLE        X-STEMux-myapp-2 (aff=[6(s0c9t0)])
47    14.5m     5.2s    5.2s    100   101   RUNNABLE        X-AEP-BusManager-IO-driver.myapp (aff=[4(s0c4t0)])
48    750.5ms   406.3us 406.3us 1     100   RUNNABLE        Thread-22
49    4.7s      331.1us 331.1us 1     100   RUNNABLE        Thread-23

Where columns can be interpreted as:

Column	Description
ID	The thread's id
CPU	The total amount of time in nanoseconds that the thread has executed (as reported by the JMX thread bean)
DCPU	The amount of time that the thread has executed in user mode or system mode (as reported by the JMX thread bean)
DUSER	The amount of time that the thread has executed in user mode in the given interval in nanoseconds (as reported by the JMX thread bean)
CPU%	The percentage of cpu time the thread used during the interval (e.g. DCPU * 100 / interval time)
USER%	The percentage of user mode cpu time the thread used during the interval (e.g. DCPU * 100 / interval time)
STATE	The thread's runnable state at the time of collection
NAME	The thread name. Note that when affinitization is enabled and the thread has been affinitized, that affinitization information is append to the thread name. This is useful when trying to determine whether a thread should be affinitized. A Busy spinning thread will typically have a CPU% of ~100. If the thread is not affinitized, it might be a good candidate.
affinity	The affinity summary string reported along with individual thread stats is not reported in a column of its own as the affinitizer appends it the thread name.

CPU times are reported according to the most appropriate short form of:

Unit	Abbreviation
Days	d
Hours	h
Minutes	m
Seconds	s
Milliseconds	ms
Microseconds	us
Nanoseconds	ns

Pool Stats

Pools stats can be traced by setting the following in DDL:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <collectPoolStats>true</collectIndividualThreadStats>
      <tracing enabled="true">
        <tracePoolStats>true</tracePoolStats>
      </tracing>
    </heartbeats>
  </server>
</server>

To reduce the size of heartbeats, Pool Stats for a given pool are only included when:

A miss has been recorded for the pool in a given interval and it results in a new object being allocated.
The number of preallocated obects taken from a pool drops below the configured value for the pool depletion threshold.

Trace Output

[Pool Stats]
PUT   DPUT  GET   DGET  HIT   DHIT  MISS  DMISS GROW  DGROW EVIC  DEVIC DWSH  DDWSH SIZE  PRE   CAP   NAME
38    0     16.8M 0     38    0     16.8M 0     0     0     0     0     0     0     0     0     1024  iobuf.native-32.20
1     0     62    0     1     0     61    0     0     0     0     0     0     0     0     0     1024  iobuf.native-64.21
1     0     1.0M  0     1     0     1.0M  0     0     0     0     0     0     0     0     0     1024  iobuf.native-256.23
7     0     75    0     7     0     68    0     0     0     0     0     0     0     0     0     1024  iobuf.heap-32.1

Stat	Description
PUT	The overall number of times items were put (returned) to a pool.
DPUT	The number of times items were put (returned) to a pool since the last time the pool was reported in a heartbeat (the delta).
GET	The overall number of times an item was taken from a pool. If pool items are not being leaked, GET - PUT indicates the number of items that have been taken from the pool and not returned ... e.g. items that are being held by messages in the transaction processing pipeline or application state.
DGET	The number of times an item was taken from a pool since the last time the pool was reported in a heartbeat (the delta).
HIT	The overall number of times that an item taken from a pool was satisfied by there being an available item in the pool.
DHIT	The number of times that an item taken from a pool was satisfied by there being an available item in the pool since the last time the pool was reported in a heartbeat(the delta).
MISS	The overall number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation.
DMISS	The number of times that an item taken from a pool was not satisfied by there being an available item in the pool resulting in an allocation since the last time the pool was reported in a heartbeat.
GROW	The overall number of times the capacity of a pool had to be increased to accomodate returned items.
DGROW	The number of times the capacity of a pool had to be increased to accomodate returned items since the last time the pool was reported in a heartbeat.
EVIC	The overall number of items that were evicted from the pool because the pool did not have an adequate capactiy to store them.
DEVIC	The overall number of items that were evicted from the pool because the pool did not have an adequate capactiy to store them since the last time the pool was reported in a heartbeat.
DWSH	The overall number of times that an item return to the pool was washed (e.g. fields reset) in the detached pool washer thread.
DDWSH	The number of times that an item return to the pool was washed (e.g. fields reset) in the detached pool washer thread since the last time the pool was reported in a heartbeat
SIZE	The number of items that are currently in the pool available for pool gets. This number will be 0 if all objects that have been allocated by the pool have been taken. Note that because pool stats are generally printed when there are pool misses, this value will often be 0 reflecting that there are no items available in the pool
PRE	The number of items initially preallocated for the pool.
CAP	The capacity of the backing array that is allocated to hold available pool items that have been preallocated or returned to the pool. The capacity of a pool will grow automatically as items are returned to the pool without being taken out. A large capacity generally indicates that at some point in the past a larger number of items was needed, but are not currently being used.
NAME	The unique identifier for the pool.

Used Defined App Stats

Stats that you defined in your application are emitted, they can be included in trace with the following configuration:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <tracing enabled="true">
        <tracePoolStats>true</tracePoolStats>
      </tracing>
    </heartbeats>
  </server>
</server>

Trace Output

[App (ems) User Stats]
...Gauges{
......EMS Messages Received: 142604
......EMS Orders Received: 35651
...}
...Series{
......[In Proc Tick To Trade(sno=35651, #points=150, #skipped=0)
.........In Proc Tick To Trade(interval): [sample=150, min=72 max=84 mean=75 median=75 75%ile=77 90%ile=79 99%ile=83 99.9%ile=84 99.99%ile=84]
.........In Proc Tick To Trade (running): [sample=35651, min=72 max=2000 mean=93 median=76 75%ile=82 90%ile=111 99%ile=227 99.9%ile=805 99.99%ile=1197]
......[In Proc Time To First Slice(sno=35651, #points=150, #skipped=0)
.........In Proc Time To First Slice(interval): [sample=150, min=85 max=98 mean=88 median=88 75%ile=90 90%ile=92 99%ile=95 99.9%ile=98 99.99%ile=98]
.........In Proc Time To First Slice (running): [sample=35651, min=84 max=4469 mean=249 median=88 75%ile=95 90%ile=133 99%ile=283 99.9%ile=3628 99.99%ile=4143]
...}

See Also:

User Defined App Stats for adding stats specific to your application to heartbeats.

Stats that you defined in your application are emitted, they can be included in trace with the following configuration:

<servers>
  <server name="my-xvm">
    <heartbeats enabled="true" interval="5">
      <tracing enabled="true">
        <traceAppStats>true</traceAppStats>
      </tracing>
    </heartbeats>
  </server>
</server>

Trace Output

The trace output

See Also:

AEP Engine Statistics for more detail about app-level stats.