DataSketches |

Density Sketch
- Foundation Paper
  - Key Highlights
  - Practical Implications
- Inspirational Code, Example and Tests

DensitySketch

Quick summary: This sketch builds a coreset from the given set of input points as multi-dimensional vectors. Provides density estimate at a given point.

Our implementation was based on the following paper:

Zohar Karnin, Edo Liberty “Discrepancy, Coresets, and Sketches in Machine Learning” https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
- The paper explores the relationship between class discrepancy and the efficiency of data reduction techniques like coresets and streaming sketches.
- The core contribution is a general framework for creating small, representative subsets of data (coresets) that maintain the statistical properties of the original large dataset.

Key Highlights:

New Complexity Measure: The authors define “class discrepancy” as a way to characterize the coreset complexity of different function families, similar to how Rademacher complexity is used for generalization.
Improved Coreset Sizes: They prove the existence of ε-approximation coresets of size $O(\sqrt{d}/\epsilon)$ for several common machine learning problems, including:
- Logistic regression
- Sigmoid activation loss
- Matrix covariance
- Kernel density estimation
Gaussian Kernel Resolution: The paper resolves a long-standing open problem by matching the lower bound for the coreset complexity of Gaussian kernel density estimation at $O(\sqrt{d}/\epsilon)$.
Streaming Algorithms: It introduces an exponential improvement to the “merge-and-reduce” trick, leading to better streaming sketches for any problem with low discrepancy.
Deterministic Algorithm: The authors provide a simple, deterministic algorithm for finding low-discrepancy sequences and coresets for any positive semi-definite kernel.

Practical Implications:

The findings allow for significantly faster optimization in large-scale machine learning. By reducing a massive dataset into a much smaller coreset, researchers can perform complex calculations (like training a logistic regression model) with a fraction of the computational cost while maintaining a high level of accuracy.

Our implementation was inspired by the following code, example, and tests by Edo Liberty:

Code: https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
Example https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_example_usage.ipynb
Tests https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_test.py

(Note: much of this overview was generated by Gemini AI, it may contain errors.)

Contents

DensitySketch

Our implementation was based on the following paper:

Key Highlights:

Practical Implications:

Our implementation was inspired by the following code, example, and tests by Edo Liberty: