API Snapshots: Java Core, Memory, Pig, Hive,

Key Features

Sketch Features Matrix

Common Sketch Properties

  • Please refer to the Sketch Criteria for the criteria for sketches to be included in the library.
  • Query results are approximate but within well defined error bounds that are user configurable by trading off sketch size with accuracy.
  • Designed for Large-scale computing environments that must handle Big Data, e.g.:
  • Maven deployable and registered with the [Central Repository](https://search.maven.org/#search ga 1 DataSketches).
  • Extensive documentation with the systems developer in mind.
  • Designed for production environments:
    • Available in multiple languages: Java, C++, Python
    • Binary compatible across systems and languages

Built-In, General Purpose Functions

  • General purpose Memory Package for managing data off the Java Heap. This enables systems designers the ability to manage their own large data heaps with dedicated processor threads that would otherwise put undue pressure on the Java heap and its garbage collection.
  • General purpose implementaion of Austin Appleby’s 128-bit MurmurHash3 algorithm, with a number of useful extensions.

Robust, High Quality Implementations.

  • Unit Tests:
    • Extensive test code leveraging TestNG.
    • High Test Code coverage (> 90%) as measured by [JaCoCo]https://www.eclemma.org/jacoco) and published by Coveralls.
  • Reproducible Characterization Studies
    • All our published speed and accuracy performance results can be reproduced using the code included in the Characterization repository.
  • Comprehensive Javadocs that satisfy JDK8 Javadoc standards.

Opportunities to Extend

  • There is ample opportunity for interested parties to contribute additional algorithms in this exciting area.

Key Algorithms

Count Distinct / Count Unique

Solves Computational Challenges Associated with Unique Identifiers

  • Estimating cardinality of a stream with many duplicates
  • Performing Set Operations (e.g., Union, Intersection, and Difference) on sets of unique identifiers
  • Estimates of the error bounds of the result can be obtained directly from the result sketch

Four families of Count Unique algorithms:

  • The HLL Sketch. The famous HyperLogLog algorithm when stored sketch size is of utmost concern.
  • The CPC Sketch. The Compressed Probabilistic Counting algorithm when maximizing accuracy per stored sketch size is of utmost concern.
  • The Theta Sketch Framework. Theta sketches enable real-time set-expression computations and can operate on or off the java heap.
  • The Tuple Sketch. Tuple sketches are associative sketches that are useful for performing approximate join operations and extracting other kinds of statistical behavior associated with unique identifiers.

Quantiles

  • Quantiles Overview. Get normal or inverse PDFs or CDFs of the distributions of any numeric value from your raw data in a single pass with well defined error bounds on the results.

Frequent Items

Sampling

  • Reservoir Sampling Knuth’s well known Reservoir sampling “Algorithm R”, but extended to enable merging across different sized reservoirs.
  • Weighted Sampling Edith Cohen’s famous sampling algorithm that enables computing subset sums of weighted samples with optimum variance.