Data Streams — Models and Algorithms 1

Maneesh Chaturvedi
4 min readDec 14, 2021

Data streams have become ubiquitous in today's large distributed systems. These data streams could be network traffic or weblogs that need analysis to derive some metric. For example, looking at network traffic to determine if there is a denial of service attack or finding the most visited links on a website.

Over the next few posts, I will cover some of the most common types of stream-based algorithms along with their applications. This post will introduce streaming models and a few simple stream-based algorithms.

Streaming Model

A stream can be considered a massively long continuous sequence of elements. The size of the stream is denoted by m, and the size of the universe of all events is denoted by n. The element of a stream is modeled as a tuple. The simplest tuple is <element_id, frequency>. In the most basic case, if the frequency is not specified, it defaults to 1.

There are two models when talking about streaming.

  1. Cash Register/Arrivals Only: This model assumes that the frequency associated with each element is additive. For example, consider the following sequence of tuples.

(a,3), (b,2),(a,2), (c,4), (d,1) — This denotes the arrival of 3 copies of a, followed by two copies of b, followed by two copies of a, followed by four copies of c and finally one copy of d, leading to a final state of (a,5),(b,2),(c,4),(d,1). The events in this example could denote network packets or clickstream data.

2. Turnstile/Arrivals and Departure: The turnstile model is more flexible and accounts for additive and decremental frequencies. For example, consider a stream that models commuters entering and leaving a subway station. The decrement should not be regarded as subtraction only. Instead, it can model fluctuating quantities like stock prices over time or compute the difference between two distributions.

Streaming Algorithms

The central goal of streaming algorithms is to process the input stream using a small amount of space, that is, to use s bits of random access working memory. Since m and n are huge, we want to make s much smaller than these. Specifically, we want s to be sublinear in both m and n.

An important thing to note is that the output of streaming algorithms computations is approximate. The inductive reasoning for this is that you cannot exclude any element in the stream to compute exact results. As a result, you end up keeping all the elements that translate to a linear space requirement, proportional to either m or n.

Streaming algorithms are also classified as

  1. Single-Pass: the algorithm can make only one pass over the data set. Single-pass algorithms comprise the following three phases.
  • Initialization: This is where we do initialization before processing the stream.
  • Processing: executed each time we see an element from the stream.
  • Output/Query: This is where we answer questions about the stream.
  1. Multi-Pass: the algorithm can make multiple passes over the data set.

Let's now look at some of the simplest streaming algorithms.

Majority Element — Boyer- Moore Algorithm

The Boyer–Moore majority algorithm is used for finding the majority element in a sequence using linear time and constant space. It was published in 1981 by Robert S. Boyer and J Strother Moore. A majority element is defined as an element that occurs strictly greater than n/2, where n is the size of the sequence. The single-pass Boyer Moore algorithm finds a majority element if one exists. Else the result can be any arbitrary element. A second pass can guarantee that the element returned is the majority element.

The pseudocode below shows the steps.

Pass 1

  1. Initialization: Assign a candidate, initially null and a counter = 0
  2. Processing: For each element x in the sequence
  • if counter is 0, set candidate = x and increment counter
  • else if x == candidate increment counter
  • else decrement counter

3. Output/Query: Return the potential candidate.

Pass 2

  1. Reset counter to 0.
  2. For each element x in the sequence
  • if candidate == x increment counter
  • return candidate if counter > n/2 else return null.

Note that the second pass is required to determine if the candidate is the majority element. The algorithm will report one of the sequence elements with a single pass, even if there is no majority element.

Misra-Gries Algorithm

In the majority algorithm above, we were looking for a majority candidate. A majority candidate was defined as an element that occurs more than n/2 times where n is the size of the sequence. The generalized form of the majority problem is called the Frequency problem. Given a parameter k, output the set of all elements with a frequency greater than n/k, where n is the size of the sequence.

The Misra-Gries Algorithm maintains an associative array, A, whose keys are elements seen in the stream and whose values are counters associated with the elements. It keeps at most k −1counters at any time. First, let us look at the pseudocode for the algorithm.

Pass 1

  1. Initialize: A (empty associative array)

2. Process element x in stream:

  • if x ∈ keys(A) then increment the counter for x in A
  • else if total_keys in A < k − 1 then set A[x] = 1
  • else
  • foreach l in keys(A) do
  • decrement the counter for l in A
  • if A[l] = 0 then remove l from A

Pass 2

  1. Output: On query a, if a ∈ keys(A), make a second pass on the sequence to get frequency, else return null

The counter A[x] is incremented only when we process an occurrence of x in the stream. On the other hand, whenever the total keys in the associative array exceed (or equals) k, we decrement k other counters, corresponding to distinct elements in the stream. Thus, each decrement is witnessed by a collection of k distinct elements from the stream. Since the stream consists of m elements, there can be at most m/k such decrements.

If some element x has frequency > m/k, then its corresponding counter A[x] will be positive at the end of the first pass over the stream; that is, x will be in keys(A). Thus, we can make a second pass over the input stream, counting exactly the frequencies for all x ∈ keys(A), and output the desired set of items.

Source code available at https://github.com/maneeshchaturvedi/streaming-algorithms

--

--

Maneesh Chaturvedi

Seasoned Software Professional, Author, Learner for Life