Streaming quantiles algorithms, or quantiles sketches, enable us to analyze the distributions of massive data very quickly using only a small amout of space. They allow us to extract values given a desired rank, or the reverse. Quantiles sketches enable us to plot the CDF, PMF or histograms of a distribution.
The goal of this short tutorial it to introduce to the reader some of the basic concepts of quantiles, ranks and their functions.
Before we can define quantile, we must first define what we mean by rank.
Given an ordered set of values the term rank can be defined two different ways.
The natural rank is a natural number from the set of one-based, natural numbers, ℕ_{1}, and is derived by enumerating an ordered set of values, starting with the value 1, up to n, the number of values in the set.
The normalized rank is a number between 0 and 1 computed by dividing the natural rank by the total number of values in the set, n. Thus, for finite sets, any normalized rank is in the range (0, 1]. Normalized ranks are often written as a percent. But don’t confuse percent with percentile! This will be explained below.
In our sketch library and documentation, when we refer to rank, we imply normalized rank. However, in this tutorial, we will sometimes use natural ranks to simplify the examples.
Normalized rank is closely associated with the concept of mass. The value associated with the rank 0.5 represents the median value, or the center of mass of the entire set where half of the values are below the median and half are above. The concept of mass is important to understanding the Prabability Mass Function (PMF) offered by the quantile sketches in the library. A rank of 0 means a mass of 0 or an empty set.
A quantile is a value that achieves a particular rank.
Quantile is the general term that describes other terms that are also quantiles. To wit:
Because of the relationship of quantiles and ranks, we can define
The functions q(r) and r(q) would form a 1:1 functional pair if q = q(r(q)) and r = r(q(r)). However, duplicate values are quite common in real data so exact 1:1 functionality is not possible. As a result it is often the case that q != q(r(q)) and r != r(q(r)). Duplicate values also could make the rank function, r(q), ambiguous. If there are multiple adjacent ranks with the same value, which rank should the rank function return?
By definiton, sketching algorithms are approximate, and they achieve their high performance by discarding a vast amount of the data. Suppose you feed n items into a sketch that retains only m items. This means n-m values were discarded. The sketch must track the value n used for computing the rank and quantile functions. When the sketch reconstructs the relationship between ranks and values n-m rank values are missing creating holes in the sequence of ranks.
The quantile sketch algorithms discussed in the literature primarily differ by how they choose which values in the stream should be discarded. After the elimination process, all of the quantiles sketch implementations are left with the challenge of how to reconstruct the actual distribution, approximately and with good accuracy.
Given the presence of duplicates and absence of values from the stream we must redefine the above quantle and rank functions as inequalities. Let’s start with a simple example.
You will find both of these in the literature. Our older quantiles/DoublesSketch and our KLL quantiles sketch use the LT criterion. Our newest REQ sketch allows the user to choose.
When searching for quantiles, we require that search to return a quantile, such that our given rank ~ r(q(r)) as close a possible.
In order to do that we use two complementing criteria.
Given the ordered values {10,20,20,20,30}, we can construct the following table of raw ranks and values. For simplicity we will use natural ranks.
Ranks, r | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Values, q | 10 | 20 | 20 | 20 | 30 |
Table 1: Raw data mapping of ranks to values
After processing the stream the actual representation inside the sketch might look like the following. This compresses out duplicate values and effectively skips over missing values. Note that the top rank will always be n.
Ranks, r | 1 | 4 | 5 |
---|---|---|---|
Values, q | 10 | 20 | 30 |
Table 1B: Raw data mapping compressed
We will use Table 1B for the following.
Given a value, V, find an adjacent pair of values, q1,q2, where q1 < V <= q2. Return the rank of q1.
Table 2 represents this mapping.
Given q | 10 | 20 | 30 |
---|---|---|---|
Find r (LT) | 0 | 1 | 4 |
Table 2: Using the LT criterion for finding ranks.
Obtaining the quantile value given the rank is going the opposite direction, so we use the GT (greater-than) criterion.
Given a rank, R, find an adjacent pair of ranks, r1,r2, where r1 <= R < r2. Return q(r2).
Given r | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Find q (GT) | 10 | 20 | 20 | 20 | 30 | 30 |
Table 3: Using the GT criterion for finding quantiles
Given a value, V, find an adjacent pair of values, q1,q2, where q1 <= V < q2. Return the rank of q1.
Given q | 10 | 20 | 30 |
---|---|---|---|
Find r (LE) | 1 | 4 | 5 |
Table 4: The LE criterion for finding ranks.
Obtaining the quantile value given the rank is going the opposite direction, so we use the GE (greater-than-or-equals) criterion.
Given a rank, R, find an adjacent pair of ranks, r1,r2, where r1 < R <= r2. Return q(r2).
Given r | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Find q (GE) | 10 | 20 | 20 | 20 | 30 |
Table 5: The GE criterion for finding quantiles.