API Latest Releases: Java Core, C++ Core, Python, Memory, Pig, Hive,

Theta Sketch Set Operations Accuracy

The Theta Rules

All set operations (Union, Intersection, Difference) between two sketches A and B must obey the following two rules:

  • θresult = min( θA , θB ).
  • All entries retained in the result sketch must be less than θresult.

These rules can be extended to arbitrary set expressions as:

  • θresult = min{ θi }.
  • All entries retained in the result sketch must be less than θresult.

Union Set Expressions

Source sketches and target with the same Nominal Entries or k

As long as all source sketches and target have been configured with the same Nominial Entries or k, and there has been no other intervening Intersection or AnotB operations, the accuracy of the resulting Union sketch will have the same Relative Standard Error (RSE) as determined by the Nominal Entries or k value. In other words,

  • Union Estimate = k/θresult.
  • RSEunion = 1 / sqrt(k - 1).

This remains true no matter how many sketches are unioned together.

Source sketches and target with different Nominal Entries or k

The Relative Standard Error (RSE) of the resulting Union sketch will determined by the Theta Rules above. Ultimately, the source sketch with the smallest theta will dominate the overall resulting error of the result. Given two sketches with equal cardinalities and different values of k, the sketch with the smaller value of k will have the smallest value of theta and will largely determine the error distribution of the result.

Mixed Set Expressions (Union, Intersection, AnotB)

Conceptually, the Intersection and AnotB functions operate by first performing a Union of all the values from both source sketches and then identifying the appropriate proper subset of that Union set creating a result sketch with the Union theta (which is the minimum theta of the source sketches) and the qualifying subset of values.

This means, of course, that depending on the operations and the data, the result set could have zero, all, or some number in between of the retained values of the Union sketch. Mixed set expressions can produce an error distribution that is larger that of a standard sketch of a given Nominal Entries or k and is mathematically described in Theta Sketch Equations / 2.3 Subsets of Fixed k Sampling.

Source sketches and target with the same Nominal Entries or k

When the source and target sketches have the same value of k, the accuracy of the result sketch has a relatively straightforward mathematical solution and intuition:

RSE = sqrt( est(Union(A,B)) / est(SetOperation(A,B)) ) * sqrt( 1/ (k-1) )
Example

The intuition for this can be explained using this simple example of intersection (set difference would behave similarly):

Suppose we have as inputs two segments of unique identifiers, A and B.

Assumptions:

  • We will use a sketch size of k = 4096 entries with an RSE ~ 1/sqrt(k) = 1.6%.
  • Segment A has 4 million (4M) unique identifiers.
  • Segment B has 4 thousand (4K) unique identifiers and is a proper subset of A.

Steps:

  • Build the Sketches. Sketch SA will be fed segment A and Sketch SB will be fed segment B.
    • Segment B fits exactly in SB, so θB = 4K/4K = 1.0.
    • Sketch SA will end up with θA = 4K/4M = .001.

The hash values retained in SA represent a uniform sampling of all of the unique identifiers in segment A and the resulting value of θA represents the effective sampling rate required to end up with k samples, which is ~ 4K/4M = .001.

Even though in the raw data all the values of segment B are in segment A, the probability that all the 4K samples of SB appear SA is extremely low since only one in one thousand can be randomly chosen.

Applying the Theta Rules:

  • θresult = min(0.001, 1.0) = .001
  • All the entries from SA already qualify.
  • Since all 4K values in SB are uniformly distributed between 0 and 1.0, only .001 of them or approximately 4 of the bottom values will remain.
  • The resulting intersection sketch will have, on average, only 4 values and a theta of .001.

The mean estimate from the intersection sketch will be 4/.001 = 4K. This happens to be correct using this hand-wavy analysis but in general is a random result with a variance. The proof that the estimate will be unbiased is in the attached Theta Sketch Equations.

The RSE of a sketch with only 4 values is ~ 1/sqrt(4) = .5 or 50% error. This is considerably larger than the RSE of either SA or SB, which is ~ 1.6% as stated above.

The insight to be gained from this example is that it was the theta (sampling rate) of the sketch of the larger population that caused the increase in error. And, for this example, increasing the sketch size of SA would improve the resulting error. The general case may be more complex.

More formally, if we define a factor F to be the ratio:

F = (size of Union(A,B) ) / (size of Intersection(A,B).

Then a simple way to compute the resulting RSE of an intersection (or difference) is

RSEintersection = sqrt(F) * (RSE of input Sketches)

And in our example:

RSEintersection = sqrt(4M/4K) * 1/sqrt(4K) = 31.63 * 0.016 = 0.5 = 50%.

Source sketches and target with different Nominal Entries or k

In the general case this scenario is even more complex and difficult to predict mathematically, but a conservative estimate would be to use the above equations substituting k with the smallest of the participating k values.