Contents
DensitySketch
Quick summary: This sketch builds a coreset from the given set of input points as multi-dimensional vectors. Provides density estimate at a given point.
Our implementation was based on the following paper:
- Zohar Karnin, Edo Liberty “Discrepancy, Coresets, and Sketches in Machine Learning”
https://proceedings.mlr.press/v99/karnin19a/karnin19a.pdf
- The paper explores the relationship between class discrepancy and the efficiency of data reduction techniques like coresets and streaming sketches.
- The core contribution is a general framework for creating small, representative subsets of data (coresets) that maintain the statistical properties of the original large dataset.
Key Highlights:
- New Complexity Measure: The authors define “class discrepancy” as a way to characterize the coreset complexity of different function families, similar to how Rademacher complexity is used for generalization.
- Improved Coreset Sizes: They prove the existence of ε-approximation coresets of size $O(\sqrt{d}/\epsilon)$ for several common machine learning problems, including:
- Logistic regression
- Sigmoid activation loss
- Matrix covariance
- Kernel density estimation
- Gaussian Kernel Resolution: The paper resolves a long-standing open problem by matching the lower bound for the coreset complexity of Gaussian kernel density estimation at $O(\sqrt{d}/\epsilon)$.
- Streaming Algorithms: It introduces an exponential improvement to the “merge-and-reduce” trick, leading to better streaming sketches for any problem with low discrepancy.
- Deterministic Algorithm: The authors provide a simple, deterministic algorithm for finding low-discrepancy sequences and coresets for any positive semi-definite kernel.
Practical Implications:
The findings allow for significantly faster optimization in large-scale machine learning. By reducing a massive dataset into a much smaller coreset, researchers can perform complex calculations (like training a logistic regression model) with a fraction of the computational cost while maintaining a high level of accuracy.
Our implementation was inspired by the following code, example, and tests by Edo Liberty:
- Code: https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde.py
- Example https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_example_usage.ipynb
- Tests https://github.com/edoliberty/streaming-quantiles/blob/f688c8161a25582457b0a09deb4630a81406293b/gde_test.py
(Note: much of this overview was generated by Gemini AI, it may contain errors.)