Streaming Algorithms — II — Counting Distinct Elements

Maneesh Chaturvedi
4 min readDec 22, 2021

Counting distinct elements is a problem that frequently arises in distributed systems. In general, the size of the set under consideration (which we will henceforth call the universe) is enormous. For example, if we build a system to identify denial of service attacks, the set could consist of all IP V4 and V6 addresses. Another common use case is to count the number of unique visitors on popular websites like Twitter or Facebook.

An obvious approach if the number of elements is not very large would be to maintain a Set. We can check if the set contains the element when a new element arrives. If not, we add the element to the set. The size of the set would give the number of distinct elements. However, if the number of elements is vast or we are maintaining counts for multiple streams, it would be infeasible to maintain the set in memory. Storing the data on disk would be an option if we are only interested in offline computation using batch processing frameworks like Map Reduce.

Like the previous algorithms we looked at, the algorithms for counting distinct elements are also approximate, with an error threshold that can be tweaked by changing the algorithm's parameters.

Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after the algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there are m distinct elements in a universe comprising of n elements, the algorithm runs in O(n) time and O(log(m)) space complexity. The following steps define the algorithm.

  • First, we pick a hash function h that takes stream elements as input and outputs a bit string. The length of the bit strings is large enough such that the result of the hash function is much larger than the size of the universe. We require at least log nbits if there are n elements in the universe.
  • r(a) is used to denote the number of trailing zeros in the binary representation of h(a) for an element a in the stream.
  • R denotes the maximum value of r seen in the stream so far.
  • The estimate of the number of distinct elements in the stream is (2^R).

To intuitively understand why the algorithm works consider the following.

The probability that h(a) ends in at least i zeros is exactly (2^-i). For example, for i=0, there is a probability 1that the tail has at least 0zeros. For i=1, there is a probability of 1/2 that the last bit is zero; for i=2, the probability is 1/4 that the last two bits are zero's, and so on. The probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least Significant Bit to the Most Significant Bit.

This probability should become 0 when bit position R >> log m while it should be non-zero when R <= log m.Hence, if we find the rightmost unset bit position R such that the probability is 0, we can say that the number of unique elements will approximately be 2 ^ R.

The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space to a uniform distribution. The general form of the hash function is

(ax + b) mod c where a and b are odd numbers and c is the length of the hash range.

The Flajolet-Martin algorithm is sensitive to the hash function used, and results vary widely based on the data set and the hash function. Hence there are better algorithms that utilize more than one hash function. These algorithms use the average and median values to reduce skew and increase the predictability of the result.

Flajolet-Martin Psuedocode and Explanation

  1. L = 64 (size of the bitset), B= bitset of size L
  2. hash_func = (ax + b) mod 2^L
  3. for each item x in stream
  • y = hash(x)
  • r = get_righmost_set_bit(y)
  • set_bit(B, r)

4. R = get_righmost_unset_bit(B)

5. return 2 ^ R

We define a hash range, big enough to hold the maximum number of possible unique values, something as big as 2 ^ 64. Every stream element is passed through a hash function that transforms the elements into a uniform distribution.

For each hash value, we find the position of the rightmost set bit and mark the corresponding position in the bitset to 1. Once all elements are processed, the bit vector will have 1s at all the positions corresponding to the position of every rightmost set bit for all elements in the stream.

Now we findR , the rightmost 0 in this bit vector. This position R corresponds to the rightmost set bit that we have not seen while processing the elements. This corresponds to the probability 0 and will help in approximating the cardinality of unique elements as 2 ^ R.

--

--

Maneesh Chaturvedi

Seasoned Software Professional, Author, Learner for Life