Apache DataSketches GitHub Component Repositories

Our library is made up of multiple components that are partitioned into GitHub repositories by language and dependencies. The dependencies of the core components are kept to a bare minimum to enable flexible integration into many different environments. The Platform Adaptor components will have major dependencies on the respective platform envionments.

If you have a specific issue or bug report that impacts only one of these components please open an issue on the respective component. If you are a developer and wish to submit a PR, please choose the appropriate repository.

If you like what you see give us a Star on these sites!

Core Sketch Libraries

The key sketches of the Apache DataSketches libraries are available in three (soon four) programming languages. By design, a sketch that is available in one language that is also available in a different language will be “binary compatible” via serialization. For example, when serialized into its compact form, a sketch created by the DataSketches C++ library, can be read by the DataSketches Java library and visa versa.

Because of differences inherent in the languages, there will be some differences in the APIs, but we try to make the same basic functionality available across all the languages.

Repository	Distribution	Comments
Java Core	Downloads	This is the original and the most comprehensive collection of sketch algorithms. It has a dependency on the Memory component
Memory (supports Java Core)	Downloads	Provides high-performance access to off-heap memory
C++ Core	Downloads	C++ was our second core language library and provides most of the major algorithms available in Java as well as a few sketches unique to C++.
Python Core	Downloads, PyPI	Python was our third core language library and contains most of the major sketch families that are in Java and C++. All the Python sketches are backed by the C++ library via Pybind.
Go Core	Under Development	Go is our fourth core language and is still evolving.

Platform Adaptors

Adapters integrate the core library components into the aggregation APIs of specific data processing platforms. Some of these adapters are available as an Apache DataSketches distribution, other adapters are directly integrated into the target platform.

Repository	Distribution	Comments
Google BigQuery Adaptor	Under Development	Depends on C++ Core
Apache Hive Adaptor	Downloads	Depends on Java Core, Integrations
Apache Pig Adaptor	Downloads	Depends on Java Core, Integrations
PostgreSQL Adaptor	Downloads, pgxn.org	Depends on C++ Core, Integrations
Apache Druid Adaptor	Apache Druid Release	Depends on Java Core, Integrations

Other

Repository	Distribution	Comments
Characterization	Not Formally Released	Used for long-running studies of accuracy and speed performance over many different parameters.
Website	Not Formally Released	Public website
Vector	Not Formally Released	This component implements the Frequent Directions Algorithm [GLP16]. It is still experimental in that the theoretical work has not yet supplied a suitable measure of error for production work. It can be used as is, but it will not go through a formal Apache Release until we can find a way to provide better error properties. It dependends on the Memory component.
Server	Not Formally Released	Under development