BloomFilter is a well known technique that has an interesting fact, you can have false positives but no false negatives.
I came across this tool in development of a web service that serves tons of requests for some user keywords based lookups. The storage uses MongoDB, and the service itself is based on Jersey, all in all, the service runs amazingly fast. Nevertheless, as a developer, I was looking for chances to optimize it. One thing bugs me most is the failure requests, as users’ keyword lookup doesn’t exist in our MongoDB. And in general such MISS scenarios cost almost the same cpu time as in a HIT scenario.
The 1st thought was to use BloomFilter to record the MISS keywords at each client, so that, when the next time it happens, no http roundtrip is required, a check in the BloomFilter would suffice. Or would it? The problem wasn’t really solved given the chance of false positives from BloomFilter. 2nd time it might be a miss, but can we afford to trust it? Another question is how large the BloomFilter is supposed to be? As user requests in theory could be an infinite set of keywords. Well, this doesn’s work. And no surprise, coz I was using BloomFilter wrong!
I shouldn’t have used BloomFilter in the negative cases, but positive cases. And in my case, I shall generate a BloomFilter upon the keywords that I have in MongoDB, and send this BloomFilter to all my clients. Each client should check against the BloomFilter before it calls the service about this suspicious keyword, if the BloomFilter said it’s not there, as no false negative exists, we’re fully confident of saving this roundtrip, also, as the keywords are enumerable, our BloomFilter’s size could be anticipated.
This solution could serve a wider audience in fact, as long as you want to save some costly operations, and you know the positive cases in advance, and you might have a big number of negative cases, using BloomFilter will be well-worth the effort. Btw, we used BloomFilter implementation from Guava, and be noticed, only after 11.0.2 the BloomFilter is correctly implemented, previous ones have performance issue in it.