where scala is beloved

Comparison of HLL implementations

In this article we will look at HLL algorithm and different implementations of it.

General Info

HLL is a propabalistic algorithm which is used for a estimation of unique values. More details about HLL you can get here. The main reason to use HLL is necessity to estimate uniques in very big amount of data in case if it is possible to sacrifice accuracy of an unique counter.

List of implementations

You can find several implementations of HLL:

Next in this article we will take a close look at all these libs and answer the question: «Why should we use HLL?».

Example of use

So why should we might use HLL in our programs? The main reason is reducing of data which we have to store.

For example you need to count the number of unique visitors for every page on a site. You can use a structure like this:

case class PageVisitors(pageUrl: String, visitors: Set[UUID])

And it will work until you have a lot of pages and very many visitors on each page. For instance let`s assume that we have an analytic system which work with visit statistic of different sites. Suppose we have 1000 pretty popular sites and because sites are popular each site have about 100000 unique visitors per day. We also want to save statistic of unique visitors per day for each site at least one year. We mark visitors using UUID - 128 bits identifier.

one UUID = 128 bits = 16 bytes per unique visitor

# for one site per day
100000 unique visitors * 16 bytes = 1600000 bytes ~= 1.52 Mbyte of data

# for one site per year
1.52 Mbyte * 365 days = 556 Mbyte

# total amount of data for 1000 sites
556 Mbyte * 1000 = 556 Gbyte of total data

But if we are ready to sacrifice accuracy a little bit we will considerably reduce amount of stored data. For example if we use prasanthj/hyperloglog to store unique counter 16Kbytes will be enough to store 10.000.000 unique values only with ~0.52% of error.

16Kbyte per site * 365 days * 1000 sites = 5.5Gbyte of total data

We will reduce the needed size of data a hundredfold if we agree with 0.52% of error.

This library contains various algorithms like HyperLogLog, CountMinSketch and others. The full list you can find here.

Below is an example of using algebird.HyperLogLogMonoid:

object SimpleAlgebirdExample {

def main(args: Array[String]) {
//define test data
val data = Seq("aaa", "bbb", "ccc")
//create algebird HLL
val hll = new HyperLogLogMonoid(bits = 10)
//convert data elements to a seq of hlls
val hlls = data.map { str =>
val bytes = str.getBytes("utf-8")
hll.create(bytes)
}

//merge seq of hlls in one hll object
val merged = hll.sum(hlls)

//WARN: don`t use merged.size - it is a different thing
//get the estimate count from merged hll
println("estimate count: " + hll.sizeOf(merged).estimate)
//or
println("estimate count: " + merged.approximateSize.estimate)
}
}

prasanthj/hyperloglog

It is a java library containing only an implementation of HyperLogLog. There is no a maven artefact for this library, so you can build it manually or use jitpack.io as it was described at the end of this article.

Also it contains a command line test tool which can help to choose settings for HLL and to see how accurate estimation will be. In order to use it you should build the library locally using maven and execute hll script.

# clone repo
git clone https://github.com/prasanthj/hyperloglog.git hyperloglog
cd hyperloglog

# build the library
mvn package -DskipTests

# execute tests
./hll -n 10000000 -s -o ./out.hll

You should see something like that:

Actual count: 10000000
Encoding: DENSE, p : 14, chosenHashBits: 128, estimatedCardinality: 10052011
Relative error: -0.52011013%
Serialized hyperloglog to ./out.hll
Serialized size: 10248 bytes
Serialization time: 13 ms

You can find out more details about params of hll on the github page of the library.

The example of how to use it is below:

object SimplePrasanthjHllExample {
import hyperloglog.HyperLogLog.HyperLogLogBuilder
def main(args: Array[String]) {
//define test data
val data = Seq("aaa", "bbb", "ccc")

//create a builder for HLL.
val hllBuilder = new HyperLogLogBuilder()
// You can set different parameters for it using
// hllBuilder.setEncoding(...)
// hllBuilder.setNumHashBits(...)
// hllBuilder.setNumRegisterIndexBits(...)
// hllBuilder.enableBitPacking(...)
// hllBuilder.enableNoBias(...)

//create hll object in which we will merge our data
val mergedHll = hllBuilder.build()

//merge data
data.foreach { elem =>
//explicitly set using encoding
val bytes = elem.getBytes("utf-8")
}

//print the estimation of count
println("estimate count: " + mergedHll.count())
}
}

It is another java library where you can find lots of algorithms. But I have not found any documentation for it. So it can be a problem. If you have a question how to use this library you can try to ask it in the mailing list here

The simple example is here:

object SimpleStreamExample {
import com.clearspring.analytics.stream.cardinality.HyperLogLogPlus
def main(args: Array[String]) {
//define test data
val data = Seq("aaa", "bbb", "ccc")

//create HLL object in which we will add our data.
// You can set parameters here in a constructor
val merged = new HyperLogLogPlus(5, 25)

data.foreach{ elem =>
//in order to control string encoding during string conversion to bytes we explicitly set using encoding
val bytes = elem.getBytes("utf-8")
merged.offer(bytes)
}

//print the estimation
println("estimate count: " + merged.cardinality())
}
}

aggregateknowledge/java-hll

This is a low-level java implementation of a hll counter. The main principle is the same but you have to define and to use your own hash function(murmur3_128 in the example below)

Simple example:

object SimpleAgknHllExample {
import net.agkn.hll.HLL

private val seed = 123456
/** use murmur3 hash function from `com.google.common.hash`*/
private val hash = Hashing.murmur3_128(seed)

def main(args: Array[String]) {
//define test data
val data = Seq("aaa", "bbb", "ccc")

//create hll object in which we will merge our data with default values of params
val hll = new HLL(13, 5)

//add data to the hll counter

println("estimate count: " + hll.cardinality())
}

def toHash(str: String): Long = {
val hasher = hash.newHasher()
//As always we set encoding explicitly
hasher.putBytes(str.getBytes("utf-8"))
hasher.hash().asLong()
}
}

Bonus: Intersection of HLLs

It is possible to make intersections of HLL counters in order to find the number of common elements in them. Such logic is implemented in twitter/algebird here. The main underlying principle in this algorithm is «Inclusion–exclusion principle». But you should be aware about how intersection affect an estimation error. I have found this article which helps to understand possible issues.

In order to make possible computation of intersection for other libs I have also created small facade which can be found in the package io.koff.hll.facade. And below you can see the example of how to use it in order to find intersection of four hlls:

package io.koff.hll

/**
* A common algorithm to find intersection of several HLL counters
*/
object Intersection {
def main(args: Array[String]) {
//"444", "555" and "666" are common elements
def set1 = Seq("111", "222", "333", "444", "555", "666").toHLL
def set2 = Seq("222", "333", "444", "555", "666", "777").toHLL
def set3 = Seq("333", "444", "555", "666", "777", "888").toHLL
def set4 = Seq("444", "555", "666", "777", "888", "999").toHLL

val intersectionCount = HLLUtils.intersection(set1, set2, set3, set4)
println("intersection: " + intersectionCount) //will print "intersection: 3"
}
}

Comparison Results

I have also written a little test which compares described libraries. You can find it here. Results you can see below:

algebird	[error: 0,000758%,  calcTime: 8547 msecs, estimateCount: 999243,  dataSize: 65536 bytes]
prasanthj	[error: -0,003540%, calcTime: 643 msecs,  estimateCount: 1003541, dataSize: 10247 bytes]
stream		[error: -0,001181%, calcTime: 294 msecs,  estimateCount: 1001182, dataSize: 43702 bytes]
agkn-hll	[error: 0,003895%,  calcTime: 508 msecs,  estimateCount: 996106,  dataSize: 40963 bytes]

So as you can see:

• twitter/algebird has showed itself as quite a slow library. It can be a problem if you work with big data.
• prasanthj/hyperloglog always needs minimal space to store serialized data of HLL counter.
• addthis/stream-lib is the fastest library in those that we have looked at.
• aggregateknowledge/java-hll(last row) is average. Not too slow, but also not super-fast, average data size.