hyperloglog-estimation-error.md

March 13, 2023 · View on GitHub

HyperLogLog estimation error

The state of an HyperLogLog sketch with precision parameter pp requires $0.75 \cdot m = 0.75 \cdot 2^pbyteswherebytes wheremdenotesthenumberofregisters.Theexpectedrelativestandarderrorisapproximatelygivenbydenotes the number of registers. The expected relative standard error is approximately given by\frac{1.039}{\sqrt{m}},,\frac{1.037}{\sqrt{m}},and, and \frac{0.833}{\sqrt{m}}forthedefault,themaximumlikelihood(ML),andthemartingaleestimator,respectively.Thisisagoodapproximationforallfor the default, the maximum-likelihood (ML), and the martingale estimator, respectively. This is a good approximation for allp\geq 6andlargedistinctcounts.However,theerrorissignificantlysmallerfordistinctcountsthatareintheorderofand large distinct counts. However, the error is significantly smaller for distinct counts that are in the order ofmorsmaller.Thebiasisalwaysmuchsmallerthantherootmeansquareerror(rmse)andcanthereforebeneglected.Thefollowingchartsshowtheempiricallydeterminedrelativeerrorasafunctionofthetruedistinctcountforvariousprecisionparametersor smaller. The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected. The following charts show the empirically determined relative error as a function of the true distinct count for various precision parametersp$ based on 100k simulation runs. Distinct counts up to 1M were simulated by generating random values as hash values. For distinct counts above 1M, a different technique is used by randomly generating only hash values at distinct counts that can actually change the state of the data structure.