register datasketches-memory-2.0.0.jar;
register datasketches-java-3.1.0.jar;
register datasketches-pig-1.1.0.jar;
define dataToSketch org.apache.datasketches.pig.quantiles.DataToDoublesSketch();
define unionSketch org.apache.datasketches.pig.quantiles.UnionDoublesSketch();
define getQuantile org.apache.datasketches.pig.quantiles.GetQuantileFromDoublesSketch();
a = load 'data.txt' as (value:double, category);
b = group a by category;
c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.value)) as sketch;
-- Sketches can be stored at this point in binary format to be used later:
-- store c into 'intermediate/$date' using BinStorage();
-- The next two lines print the results in human readable form for the purpose of this example
d = foreach c generate category, getQuantile(sketch, 0.5); -- median value from the sketch
dump d;
-- This can be a separate query
-- For example, the first part can produce a daily intermediate feed and store it,
-- and this part can load several instances of this daily intermediate feed and union them
-- c = load 'intermediate/$date1,intermediate/$date2' using BinStorage() as (category, sketch);
e = group c all;
f = foreach e generate flatten(unionSketch(c.sketch)) as (sketch);
g = foreach f generate getQuantile(sketch, 0.5); -- median value from the sketch
dump g;
The example input data has 2 fields: value and category. The first part of the query produces a QuantilesSketch per category, and the second part merges sketches across categories.
From ‘dump d’:
(a,6.0)
(b,16.0)
From ‘dump g’ (merged across categories):
(11.0)
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 a
9 a
10 a
11 b
12 b
13 b
14 b
15 b
16 b
17 b
18 b
19 b
20 b