where scala is beloved

# Using T-Digest: Median calculation and anomaly detection

In this article you can find information about using `t-digest` library in order to measure average value of some quantity(average session time). There is also an answer for the question: What and why should you use to make such the measurement mean or median? Besides, list and comparison of different implementations is presented below in the article.

## Main problem

So in what cases we have need to calculate mean/median? For example we have a site and we want to understand how much time an average user spent on our site. In order to do it we should calculate an average duration of a user session. And there are at least two ways to do it - calculate an arithmetic mean(or just mean) or calculate a median. The calculation of mean is very simple. You need two fields: one for a sum of elements and another for their count. But it doesn`t work very well with anomalies in the data. I mean the case when one or several elements differ greatly from others. Lets assume that we have such values for our session durations(in milliseconds):

``````3000, 2000, 3000, 5000, 3000, 4000, 4500, 3200, 2700, 3380
``````

mean = `(3000+2000+3000+5000+3000+4000+4500+3200+2700+3380) / 10 = 3378 msecs`. In this case all is ok.

But what if one of these users opens the site, forgets to close a browser tab and goes afk for an hour(3.600.000 msecs):

``````3000, 2000, 3000, 5000, 3000, 4000, 4500, 3200, 2700, 3600000
``````

mean = `(3000+2000+3000+5000+3000+4000+4500+3200+2700+3600000) / 10 = 363040 msecs`. Just one of the users influences strongly on mean value. Generally speaking, the mean is only representative if the distribution of the data is symmetric, otherwise it may be heavily influenced by outlying measurements. In simple cases it is possible to use some kind of a filter. But often we just don’t know what threshold we should use to filter values. Whereas the median value is the same in both cases and is equal to `3100`. So in the cases like this the median will be more useful then the mean. However the calculation of the median in general case needs a lot of memory - O(n)

## T-digest

In order to deal with memory consumption `Q-Digest` algorithm has been developed and after that `T-digest` has been created as improvement of `Q-digest`. The main idea of these algorithms is to sacrifice accuracy of calculation for decrease of required amount of memory.

## List of implementations

I’ve found these libs for median calculation:

`apache/commons-math` implements a regular algorithm for median calculation. It is used for the estimation of calculation accuracy of viewed `t-digest` implementations.

## tdunning/t-digest: simple example

A short description from github page:

``````A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.
``````

Friendliness for a parallel calculation shows itself by support of a merging operation. You can see it in the example below.

Example of the code for t-digest:

``````object TDigestExample {
import com.tdunning.math.stats.TDigest

def main(args: Array[String]) {
//define constants for experiments
val oneSecond = 1000
val twoMinutes = 2 * 60 * 1000
val tenMinutes = 10 * 60 * 1000
val twoHours = 2 * 60 * 60 * 1000

val mainValues = 10000000

//generate 10.000.000 pseudo-random values for normal user session durations
val mainData = Generator.generate(count = mainValues, from = oneSecond, to = twoMinutes)

//generate 100.000(1%) pseudo-random values for invalid user session durations
val badData = Generator.generate(count = badValues, from = tenMinutes, to = twoHours)

//generate one united collection. For further details see below
val totalData = {
Generator.generate(count = mainValues, from = oneSecond, to = twoMinutes) ++
Generator.generate(count = badValues, from = tenMinutes, to = twoHours)
}

// Experiment #1
// All data in one digest object
// all values from one collection(totalData) are added to one digest object

// recommend default value of compression = 100
val totalDigest = TDigest.createArrayDigest(100)

//the timing of the median calculation
val startTime = System.currentTimeMillis()
val median = totalDigest.quantile(0.5d)               //the median is a 0.5 quantile
val calcTime = System.currentTimeMillis() - startTime
println("calcTime: " + calcTime)

// Experiment #2
// Separate objects for normal(mainData) and bad data(badData) to check accuracy of t-digest merging
val mainDigest = TDigest.createArrayDigest(100)

//to check accuracy of merging several digests
val mergedDigest = TDigest.createArrayDigest(100)

val mergedMedian = mergedDigest.quantile(0.5d)
println("median:       " + median)
println("mergedMedian: " + mergedMedian)

//how many bytes is needed to serialize t-digest object
println("byte size:    " + totalDigest.byteSize())

//how much memory is needed for totalDigest object
val sizeInMem = SizeEstimator.estimate(totalDigest)
println(s"size in mem:  \$sizeInMem bytes")
}
}
``````

In general there is no much difference in the interface between `tdunning/t-digest` and `addthis/stream-lib`. The example for `addthis/stream-lib` you can find in `StreamLibExample.scala` file. However `tdunning/t-digest` has more accuracy and less calculation time. More details about lib comparison are below.

## Comparison

Now lets compare our libraries. The libraries are compared by:

• Time of calculation
• Calculation accuracy
• Serialization size
• Needed memory

Results:

Median Value Calc.Time(msec) Calc. Acc.(%) Serial. size(bytes) Memory(bytes)
commons-math 61084,0000 5.922 100,000000 not-supported 134.218.824
t-digest 61084,6668 8.234 100,001090 15740 29.688
stream-lib 60763,2110 20.142 99,474839 24820 231.784

## Anomaly detection

One more interesting application of the median is anomaly detection. In order to find anomalies in your data sequence threshold should be calculated. For this you need to get 99.9%-quantile which means that we expect ~0.1% of anomalies in the data sequence. Look below to see how to do it using `tdunning/t-digest`.

``````/**
* Example of anomaly detection
*/
object AnomalyDetectionExample {
import com.tdunning.math.stats.TDigest
def main(args: Array[String]) {
//define constants for experiments
val oneSecond = 1000
val twoMinutes = 2 * 60 * 1000
val tenMinutes = 10 * 60 * 1000
val twoHours = 2 * 60 * 60 * 1000

val mainValues = 10000000

//generate 10.000.000 pseudo-random values for normal user session durations
val mainData = Generator.generate(count = mainValues, from = oneSecond, to = twoMinutes)

//generate 100.000(1%) pseudo-random values for invalid user session durations
val badData = Generator.generate(count = badValues, from = tenMinutes, to = twoHours)

val totalData = mainData ++ badData

val totalDigest = TDigest.createArrayDigest(100)