API Latest Releases: Java Core, C++ Core, Python, Memory, Pig, Hive,

Contents

Reservoir Sampling Sketches

Reservoir sampling provides a way to construct a uniform random sample of size k from an unweighted stream of items, without knowing the final length of the stream in advance. As with all sketches in the library, reservoir sampling sketches can be efficiently unioned.

The Sketches Library provides 2 forms of reservoir sampling sketches:

  • ReservoirItemsSketch<T>

    This sketch provides a random sample of items of type <T> from the item stream. Every item in the stream will have an equal probability of being included in the output reservoir. If processing the entire data set with a single instance of this class, all permutations of the input elements will be equally likely; if unioning multiple reservoirs, the guarantee is only that each element is equally likely to be selected.

    If the user needs to serialize and deserialize the resulting sketch for storage or transport, the user must also extend the ArrayOfItemsSerDe interface. Three examples of extending this interface are included for Long, String, and Number: ArrayOfLongsSerDe, ArrayOfStringsSerDe, and ArrayOfNumbersSerDe.

  • ReservoirLongsSketch

    This provides a custom implementation for items of type long. Performance is nearly identical to that of the items sketch, but there is no need to provide an ArrayOfItemsSerDe.

Space Usage

The reservoir is initialized with a reservoirCapacity indicating the maximum number of items that can be stored in the reservoir. In contrast to some other sketches in this library, the size does not need to be a power of 2.

When serialized, these sketches use 16 bytes of header data in addition to the serialized size of the items in the reservoir.

Updating the sketch with new items

Reservoir sampling does not maintain a hash list of items or associate additional metadata with them; each item presented to the sketch is handled independently. Consequently, the reservoir may have duplicate items if the input stream contains duplicates.

Unioning Reservoirs

As mentioned above, using a single reservoir to process a data stream ensures that the resulting reservoir contains all elements with equal probability, but that all permutations of the input are also equally probable (subject to the limits of the random number generator). That additional guarantee over permutations no longer applies to reservoir unions.

To union two reservoirs, we first compute a weight for the items in each reservoir based on both the reservoir size and the total number of items that have been presented to the reservoir. Because the items are themselves unweighted, the weight reflects the relative weight of each reservoir and affects the probability of taking items from each.

A union object is initiailzed with a maxUnionCapacity. Unlike the reservoir size, which can only grow until saturating at its capacity, the actual number of items stored in a union can both grow and shrink, depending on the sizes and weights of the input sketches. The only guarantee is that the union result will not grow beyond maxUnionCapacity