A Count Estimation Algorithm

I read an article in Quanta Magazine about a new algorithm for estimating the count of unique elements in some stream. It sounded so simple that I loaded up a python notebook and started coding it.

Here's my summary:

  • You have a whiteboard with a limit on the number of words you can keep on it"
  • You'll go through some process of adding the word which I'll describe later.
  • When you reach the limit on the number of words in the whiteboard, for each word remove it by a random coin flip. You'll end up with about half the words left in the whiteboard.
  • At the end, return some unique count estimate based on some variable that got modified in the middle step above.
The problem for me came in the second step. Here's what the article says:
Now you move forward with what the team calls Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that's already on your list, flip a coin again. If it's tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.
Then, on subsequent rounds, you have to flip the coin the number of times as the current round number.

But, when I ran my intepreation, the results were wildly different. Not even in the same ballpark. I should have read the original paper. I am really rusty on my academic pseudo-code but, after much careful reading and re-reading, I realized that the algorithm description in the article is wrong, and actually describes a specific case of the algorithm than the one in the paper. Turns out I wasn't the only one to discover this; several readers commented with corrections.

Here's the algorithm from the paper:

Input Stream A \(= \langle a_1, a_2, ..., a_m \rangle\)
Initialize \(p \gets 1; X \gets \emptyset;\) thresh \(\gets 1 \)
for \(i = 1\) to \(m\) do
\(X \gets X \setminus \{a_i\}\)
With probability \(p\), \(X \gets X \cup \{a_i\}\)
if \(|X| =\) thresh then
Throw away each element of \(X\) with probability \(\frac 12 \)
\(p \gets \frac p2 \)
if \(|X| = \) thresh then Output \(\bot\)
Output \(\frac{|X|}p\)

I really enjoy Quanta Magazine articles; they make difficult and abstract subjects more graspable by lay people like myself. The paper itself goes on to prove the correctness of the algorithm and why it even works, although some of it eludes me because I need to brush up and even learn some probabilty stuff I'm not familiar with.

Play with NASA Perseverance Image Data

I wrote this Jupyter Notebook to query, load, and manipulate images from the NASA Perseverance Raw Image Pipeline. Feel free to copy and use to explore. Here's a colorized image from Sol 2.

Missing Sounds of New York: An Auditory Love Letter to New Yorkers

This collection of audio landscapes that you'd hear around New York was released near the beginning of the 2020 covid-19 pandemic, when the city was on lockdown. Created by the New York City Public Library and the creative agency Mother New York. Go Listen →

Sample: Weddell Seal Vocalization

Taken from the video on twitter. Download the mp4.