Our library is made up of components that are partitioned into GitHub repositories by language and dependencies. The dependencies of the core components are kept to a bare minimum to enable flexible integration into many different environments. Meanwhile, the Hive and Pig components, for example, have major dependencies on those envionments.
If you have a specific issue or bug report that impacts only one of these components please open an issue on the respective component. If you are a developer and wish to submit a PR, please choose the appropriate repository.
|Server (Under Development)||https://github.com/apache/datasketches-server|
|Reserved for future use||https://github.com/apache/datasketches|
If you like what you see give us a Star on one of these two sites!
Java (Versioned, Apache Released) This is the original and the most comprehensive collection of sketch algorithms. It has a dependence on the Memory component and the Java Adaptors have a dependence on this component.
C++/Python (Versioned, Apache Released) This is newer and provides most of the major algorithms available in Java. Our C++ adaptors have a dependence on this component. The Pybind adaptors for Python are included for all the C++ sketches.
Adapters integrate the core components into the aggregation APIs of specific data processing systems. Some of these adapters are available as part of the library, other adapters are directly integrated into the target data processing application.
The code in these components are no longer maintained and will eventually be removed.
This is a new repository dedicated to sketches designed to be run in a mobile client, such as a cell phone. It is still in development and should be considered experimental.
This repository is an experimental staging area for code that will eventually end up in another repository. This code is not versioned.
Demos, command-line access, characterization testing and other code not related to production deployment.
This code is offered “as is” and primarily as a reference so that users can understand how some of the performance characterization plots were obtained. This code has few unit tests, if any, and was never intended for production use. Nonetheless, some folks have found it useful. If you find it useful, go for it. This code is not versioned.
|Sketches-misc Packages||Package Description|
|org.apache.datasketches||Utility functions used by the sketches-misc packages|
|org.apache.datasketches.cmd||Support for Command Line functions Being Redesigned|
|org.apache.datasketches.demo||Simple demo for brute-force vs Theta and HLL sketches Will be superceded by Command Line functions|
|org.apache.datasketches.quantiles||Utility for computing & printing space table for Quantiles Sketches (only in the test branch)|
|org.apache.datasketches.sampling||Benchmarks and Entropy testing for sampling sketches|
This is the parallel characterization repository with a parallel objective to the Java characterization repository.
This repository is an experimental staging area for C++ code that will eventually end up in another repository.
These repositories provide a command-line tool that provides access to the following sketches:
This tool can be installed from Homebrew.