Probabilistic Counting and Adaptive Sampling

Probabilistic Counting and Adaptive Sampling algorithms are probabilistic algorithms used for estimating the number of distinct elements in multiset. These algorithms can be used to optimise relational database queries, where queries contain joins between tables and tables have large amount of entries.

These algorithms are interesting, because they estimate unique element count in O(N) time while using O(m) space, where 'N' is the number of elements in the multiset and 'm' is a parameter that can be adjusted to increase accuracy at the cost of extra storage and computing required.

This page implements both of these algorithms and allows you to run experiments to check how accurately these algorithms estimate unique element count.

In general case, Adaptive Sampling algorithm is ~50% less accurate than Probabilistic Counting algorithm, however Adaptive Sampling algorithm is simpler, and produces more accurate estimates when ratio of (actual unique element count/'m' parameter) is < 1.

Source of the algorithms:

Probabilistic Counting and Adaptive Sampling

Information