API Latest Releases: Java Core, C++ Core, Python, Memory, Pig, Hive,

Contents

Compressed Probability Counting (CPC) Sketches1

The cpc package contains implementations of Kevin J. Lang’s CPC sketch1. The stored CPC sketch can consume about 40% less space than an HLL sketch of comparable accuracy. Nonetheless, the HLL and CPC sketches have been intentially designed to offer different tradeoffs so that, in fact, they complement each other in many ways.

  • The CPC sketch has better accuracy for a given stored size then HLL
  • HLL has faster serialization and deserialization times than CPC.

Similar to the HLL sketch, the primary use-case for the CPC sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count.

Neither HLL nor CPC sketches provide means for set intersections or set differences. If you anticipate your application might require this capability you are better off using the Theta family of sketches.


  1. Kevin J Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839, 2017.  2