API Snapshots: Java Core, C++ Core, Python, Memory, Pig, Hive,

Original Quantiles Sketch

Quantiles Sketches Accuracy and Size

Please review the Quantiles Definitions.

The accuracy of a quantile sketch is a function of the configured value k, which also affects the overall size of the sketch.

Accuracy for quantiles sketches is specified and measured with respect to the rank only, not the values.

Absolute vs Relative Error

The Quantiles/DoublesSketch and the KLL Sketch have absolute error. For example, a specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could extract it from the set) should be between getQuantile(0.49) and getQuantile(0.51). This same 1% error applied at a rank of 0.95 means that the true value should be between getQuantile(0.94) and getQuantile(0.96). In other words, the error is a fixed +/- epsilon for the entire range of rank values.

The ReqSketch, however, has relative rank error and the user can choose which end of the rank domain should have high accuracy. Refer to the sketch documentation for more information.

Quantiles Sketches and Data Independence

A sketch is an implementation of a streaming algorithm. By definition, a sketch has only one chance to examine each item of the stream. It is this property that makes a sketch a streaming algorithm and useful for real-time analysis of very large streams that may be impractical to actually store.

We also assume that the sketch knows nothing about the input data stream: its length, the range of the values or how the values are distributed. If the authors of a particular algorithm require the user to know any of the above attributes of the input data stream in order to “tune” the algorithm, the algorithm is not data independent.

The only thing the user needs to know is how to extract the values from the stream so that they can be fed into the sketch. It is reasonable that the user knows the type of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch.

Accuracy Information for the org.apache.datasketches.quantiles Sketch Package

A k of 256 produces a normalized rank error of less than 1%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.49 and 0.51, with a confidence of about 99%.

The approximate error (epsilon) values listed in the second row of the header in the table below can be computed using the function double DoubleSketch.getNormalizedRankError(int k). The values in this table match very closely with empirical measurements up to k = 1024 at the 99% confidence level. Beyond k = 1024, the estimated error is somewhat speculative, but should be reasonable guidance. These error values simultaneously apply to all half-open intervals (-Infinity, Q] and closed intervals [Q1, Q2].

Table Guide for Quantiles DoublesSketch Size in Bytes and Approximate Error:
          K => |        16        32        64       128       256       512     1,024     2,048     4,096     8,192    16,384    32,768
    ~ Error => |   12.145%    6.359%    3.317%    1.725%    0.894%    0.463%    0.239%    0.123%    0.063%    0.033%    0.017%    0.009%
             N | Size in Bytes ->
----------------------------------------------------------------------------------------------------------------------------------------
             0 |         8         8         8         8         8         8         8         8         8         8         8         8
             1 |        64        64        64        64        64        64        64        64        64        64        64        64
             3 |        64        64        64        64        64        64        64        64        64        64        64        64
             7 |        96        96        96        96        96        96        96        96        96        96        96        96
            15 |       160       160       160       160       160       160       160       160       160       160       160       160
            31 |       288       288       288       288       288       288       288       288       288       288       288       288
            63 |       416       544       544       544       544       544       544       544       544       544       544       544
           127 |       544       800     1,056     1,056     1,056     1,056     1,056     1,056     1,056     1,056     1,056     1,056
           255 |       672     1,056     1,568     2,080     2,080     2,080     2,080     2,080     2,080     2,080     2,080     2,080
           511 |       800     1,312     2,080     3,104     4,128     4,128     4,128     4,128     4,128     4,128     4,128     4,128
         1,023 |       928     1,568     2,592     4,128     6,176     8,224     8,224     8,224     8,224     8,224     8,224     8,224
         2,047 |     1,056     1,824     3,104     5,152     8,224    12,320    16,416    16,416    16,416    16,416    16,416    16,416
         4,095 |     1,184     2,080     3,616     6,176    10,272    16,416    24,608    32,800    32,800    32,800    32,800    32,800
         8,191 |     1,312     2,336     4,128     7,200    12,320    20,512    32,800    49,184    65,568    65,568    65,568    65,568
        16,383 |     1,440     2,592     4,640     8,224    14,368    24,608    40,992    65,568    98,336   131,104   131,104   131,104
        32,767 |     1,568     2,848     5,152     9,248    16,416    28,704    49,184    81,952   131,104   196,640   262,176   262,176
        65,535 |     1,696     3,104     5,664    10,272    18,464    32,800    57,376    98,336   163,872   262,176   393,248   524,320
       131,071 |     1,824     3,360     6,176    11,296    20,512    36,896    65,568   114,720   196,640   327,712   524,320   786,464
       262,143 |     1,952     3,616     6,688    12,320    22,560    40,992    73,760   131,104   229,408   393,248   655,392 1,048,608
       524,287 |     2,080     3,872     7,200    13,344    24,608    45,088    81,952   147,488   262,176   458,784   786,464 1,310,752
     1,048,575 |     2,208     4,128     7,712    14,368    26,656    49,184    90,144   163,872   294,944   524,320   917,536 1,572,896
     2,097,151 |     2,336     4,384     8,224    15,392    28,704    53,280    98,336   180,256   327,712   589,856 1,048,608 1,835,040
     4,194,303 |     2,464     4,640     8,736    16,416    30,752    57,376   106,528   196,640   360,480   655,392 1,179,680 2,097,184
     8,388,607 |     2,592     4,896     9,248    17,440    32,800    61,472   114,720   213,024   393,248   720,928 1,310,752 2,359,328
    16,777,215 |     2,720     5,152     9,760    18,464    34,848    65,568   122,912   229,408   426,016   786,464 1,441,824 2,621,472
    33,554,431 |     2,848     5,408    10,272    19,488    36,896    69,664   131,104   245,792   458,784   852,000 1,572,896 2,883,616
    67,108,863 |     2,976     5,664    10,784    20,512    38,944    73,760   139,296   262,176   491,552   917,536 1,703,968 3,145,760
   134,217,727 |     3,104     5,920    11,296    21,536    40,992    77,856   147,488   278,560   524,320   983,072 1,835,040 3,407,904
   268,435,455 |     3,232     6,176    11,808    22,560    43,040    81,952   155,680   294,944   557,088 1,048,608 1,966,112 3,670,048
   536,870,911 |     3,360     6,432    12,320    23,584    45,088    86,048   163,872   311,328   589,856 1,114,144 2,097,184 3,932,192
 1,073,741,823 |     3,488     6,688    12,832    24,608    47,136    90,144   172,064   327,712   622,624 1,179,680 2,228,256 4,194,336
 2,147,483,647 |     3,616     6,944    13,344    25,632    49,184    94,240   180,256   344,096   655,392 1,245,216 2,359,328 4,456,480
 4,294,967,295 |     3,744     7,200    13,856    26,656    51,232    98,336   188,448   360,480   688,160 1,310,752 2,490,400 4,718,624

The following graphs illustrate the ability of the Quantiles DoublesSketch to characterize value distributions.

  • 1024 (n) values (trueUnsortedValues) were generated from Random’s nextGaussian(). These values were then sorted (trueSortedValues) and assigned their true normalized ranks (trueRanks) from 0 to (n-1)/n.

  • A DoublesSketch (sketch) was then created with K = 32 and updated with the trueUnsortedValues.

  • sketch.getCDF(trueSortedValues) produced an ordered array of estimated ranks (estimatedRanks) . The estimatedRanks (red) were then plotted against the trueRanks (black). The upper bound error was plotted in blue and the lower bound error was plotted in green.

  • The error of the estimation from the sketch is called “Epsilon” and is always with respect to the rank axis. It was plotted as a visual bar on the graph to illustrate its size.

  • The size of this sketch, if stored, would be about 1832 bytes.

QuantilesCDF

  • A getQuantiles(trueRanks) produced an ordered array estimatedSortedValues, which correspond to the trueRanks. Plotting the estimatedSortedValues against the trueSortedValues produces the inverse CDF plot as follows:

QuantilesInverseCDF

The absolute rank error vs the trueRanks produced the following graph.

QuantilesCDFAbsRankError

All of the above plots were generated from one trial, and is not a test of the error bounds.

The following plot illustrates the 99th percentile of observed maximum normalized rank error of DoublesSketch with k=128 in 1000 trials at each stream length. The code to reproduce this measurement is available in the DataSketches/characterization repository. Note that these measurements are not directly comparable to the values in the table above as this graph plots the error for only the half-open intervals (-Infinity, Q], which is relevant to simple queries such as getRank(value).

QuantilesRankError