API Snapshots: Java Core, Memory, Pig, Hive,

Designed for Large-scale Computing Systems

Easy Integration with Minimal Dependencies

  • Java Core
    • The Java core library (including Memory) has no dependencies outside of the Java JVM at runtime allowing simple integration into virtually any Java based system environment.
    • All of the Java components are Maven Deployable and registered with The Central Repository
  • C++ Core
    • The C++ core is written as all header files allowing easy integration into a wide range of operating system environments.
  • Python
    • The C++ Core is extended using the python binding library pybind11 enabling high performance operation from Python.

Cross Language Binary Compatibility

  • Sketches serialized from C++ or Python can be interpreted by compatible Java sketches and visa versa.

Speed

  • These single-pass, “one-touch” algorithms are fast (see example) to enable real-time processing capability.

  • Sketches can be represented in an updatable or compact form. The compact form is smaller, immutable and faster to merge.

  • Some of the Java sketches have been designed to be instantiated and operated off-heap, whicn eliminates costly serialization and deserialization.

  • The sketch data structures are “additive” and embarrassingly parallelizable. Sketches can be merged without losing accuracy.

Systems Integrations

Specific Sketch Features for Large Data

  • Hash Seed Handling. Additional protection for managing hash seeds which is particularly important when processing sensitive user identifiers. Available with Theta Sketches.

  • Pre-Sampling. Built-in up-front sampling for cases where additional contol is required to limit overall memory consumption when dealing with millions of sketches. Available with Theta Sketches.

  • Memory Package. Large query systems often require their own heaps outside the JVM in order to better manage garbage collection latencies. The Java sketches utilize this powerful package.

  • Built-in Upper-Bound and Lower-Bound estimators. You are never in the dark about how good of an estimate the sketch is providing. All the sketches are able to estimate the upper and lower bounds of the estimate given a confidence level.

  • User configurable trade-offs of accuracy vs. storage space as well as other performance tuning options.

  • Small Footprint Per Sketch. The operating and storage footprint for both row and column oriented storage are minimized with compact binary representations, which are much smaller than the raw input stream and with a well defined upper bound of size.