The DataSketches Library is organized into the following repository groups:
This repository has the core-java sketching classes, which are leveraged by some of the other repositories.
This repository has no external dependencies outside of the DataSketches/memory repository, Java and TestNG for unit tests.
This code is versioned and the latest release can be obtained from
Downloads.
High-level Repositories Structure
Packages (org.apache.datasketches.*) | Description |
---|---|
common | Common functions and utilities |
cpc | New Unique Counting Sketch with better accuracy per size than HLL |
fdt | Frequent Distinct Tuples Sketch. |
filters | Bloomfilter, Quotientfilter, etc. |
frequencies | Frequent Item Sketches, for both longs and generics |
hash | The 128-bit MurmurHash3 and adaptors |
hll | Unique counting HLL sketches for both heap and off-heap. |
hllmap | The (HLL) Unique Count Map Sketch |
kll | Quantiles sketch with better accuracy per size than the standard quantiles sketch. Includes PMF, CDF functions, for floats, doubles. On-heap & off-heap. |
partitions | Special tools to enable large-scale partitioning using the quantiles sketches. |
quantiles | Standard Quantiles sketch, plus PMF and CDF functions, for doubles and generics. On-heap & off-heap. |
quantilescommon | Common functions used by all the quantiles sketches. |
req | Relative Error Quantiles (REQ) sketch, plus PMF and CDF functions for floats, on-heap. Extremely high accuracy for very high ranks (e.g., 99.999%ile), or very low ranks (e.g., .00001%ile. |
sampling | Weighted and uniform reservoir sampling with generics |
theta | Unique counting Theta Sketches for both on-heap & off-heap |
thetacommon | Common functions used by all the Theta and Tuple sketches |
tuple | Tuple sketches for both primitives and generics |
tuple.adouble | A Tuple sketch with a Summary of a single double |
tuple.arrayofdoubles | Dedicated implementation of a Tuple sketch with an array of doubles Summary. |
tuple.aninteger | A Tuple sketch with a Summary of a single integer |
tuple.Strings | A Tuple sketch with a Summary of an array of Strings |
This repository contains Hive UDFs and UDAFs for use within Hadoop grid enviornments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Downloads.
Packages (org.apache.datasketches.*) | Description |
---|---|
common | Common functions |
hive.cpc | Hive UDF and UDAFs for CPC sketches |
hive.frequencies | Hive UDF and UDAFs for Frequent Items sketches |
hive.hll | Hive UDF and UDAFs for HLL sketches |
hive.kll | Hive UDF and UDAFs for KLL sketches |
hive.quantiles | Hive UDF and UDAFs for Quantiles sketches |
hive.theta | Hive UDF and UDAFs for Theta sketches |
hive.tuple | Hive UDF and UDAFs for Tuple sketches |
This repository contains Pig User Defined Functions (UDF) for use within Hadoop grid environments. This code has dependencies on sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Downloads.
Packages (org.apache.datasketches.*) | Description |
---|---|
pig.cpc | Pig UDFs for CPC sketches |
pig.frequencies | Pig UDFs for Frequent Items sketches |
pig.hash | Pig UDFs for MurmerHash3 |
pig.hll | Pig UDFs for HLL sketches |
pig.kll | Pig UDFs for KLL sketches |
pig.quantiles | Pig UDFs for Quantiles sketches |
pig.sampling. | Pig UDFs for Sampling sketches |
pig.theta | Pig UDFs for Theta sketches |
pig.tuple | Pig UDFs for Tuple sketches |
This is the evolving C++ implementations of the same sketches that are available in Java. These implementations are binary compatible with their counterparts in Java. In other words, a sketch created and serialized in C++ can be opened and read in Java and visa-versa. This code is versioned and the latest release can be obtained from Downloads.
Directory | Description |
---|---|
common | Common functions |
count | Count-Min Sketch |
cpc | CPC Sketch |
density | Density Sketch |
fi | Frequent Items Sketch |
hll | HLL Sketch |
kll | KLL Sketch |
quantiles | Classic Quantiles Sketch |
req | REQ Sketch |
sampling | Sampling sketches |
tdigest | t-Digest Sketch |
theta | Theta sketches |
tuple | Tuple sketches |
This site provides the postgres-specific adaptors that wrap the C++ implementations making them available to the PostgreSQL database users. PostgreSQL users should download the PostgreSQL extension from pgxn.org. For examples refer to the README on the component site. This code is versioned and the latest release can be obtained from Downloads.
Files (src/*) | Description |
---|---|
aod_sketch_c_adapter.h | Tuple Array-Of-Doubles Sketch |
cpc_sketch_c_adapter.h | CPC Sketch |
frequent_strings_sketch_c_adapter.h | Frequent Strings Sketch |
hll_sketch_c_adapter.h | HLL Sketch |
kll_double_sketch_c_adapter.h | KLL Doubles Sketch |
kll_float_sketch_c_adapter.h | KLL Floats Sketch |
quantiles_double_sketch_c_adapter.h | Classic Doubles Quantiles Sketch |
req_float_sketch_c_adapter.h | REQ Floats Sketch |
theta_sketch_c_adapter.h | Theta Sketch |
This site has our Python adaptors that wrap the C++ implementations, making the high performance C++ implementations available from Python.
Files (src/*) | Description |
---|---|
count_wrapper.cpp | Count-Min Sketch |
cpc_wrapper.cpp | CPC Sketch |
density_wrapper.cpp | Density Sketch |
ebpps_wrapper.cpp | EB-PPS Sampling Sketch |
fi_wrapper.cpp | Frequent Items Sketch |
hll_wrapper.cpp | HLL Sketch |
kll_wrapper.cpp | KLL Sketch |
quantiles_wrapper.cpp | Classic Quantiles Sketch |
req_wrapper.cpp | REQ Sketch |
theta_wrapper.cpp | Theta sketches |
tuple_wrapper.cpp | Tuple sketches |
vector_of_kll.cpp | KLL Vector |
vo_wrapper.cpp | VarOpt Sketch |
This is a new experimental repository for our experimental docker/container server that enables easy access to the core sketches in the library via HTTP. This component is not formally released and code must be obtained from the GitHub site.
This experimental component implements the Frequent Directions Algorithm [GLP16]. It is still experimental in that the theoretical work has not yet supplied a suitable measure of error for production work. It can be used as is, but it will not go through a formal Apache Release until we can find a way to provide better error properties. It has a dependence on the Memory component. This component is not formally released and code must be obtained from the GitHub site.