A Count Estimation Algorithm
I read an article in Quanta Magazine about a new algorithm for estimating the count of unique elements in some stream. It sounded so simple that I loaded up a python notebook and started coding it.
Here's my summary:
- You have a whiteboard with a limit on the number of words you can keep on it"
- You'll go through some process of adding the word which I'll describe later.
- When you reach the limit on the number of words in the whiteboard, for each word remove it by a random coin flip. You'll end up with about half the words left in the whiteboard.
- At the end, return some unique count estimate based on some variable that got modified in the middle step above.
Now you move forward with what the team calls Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that's already on your list, flip a coin again. If it's tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.Then, on subsequent rounds, you have to flip the coin the number of times as the current round number.
But, when I ran my intepreation, the results were wildly different. Not even in the same ballpark. I should have read the original paper. I am really rusty on my academic pseudo-code but, after much careful reading and re-reading, I realized that the algorithm description in the article is wrong, and actually describes a specific case of the algorithm than the one in the paper. Turns out I wasn't the only one to discover this; several readers commented with corrections.
Here's the algorithm from the paper:
I really enjoy Quanta Magazine articles; they make difficult and abstract subjects more graspable by lay people like myself. The paper itself goes on to prove the correctness of the algorithm and why it even works, although some of it eludes me because I need to brush up and even learn some probabilty stuff I'm not familiar with.