If you have the need to create a BloomFilter against a very large data set, you might be interested. The problem arised from such need in my work, as I have a set of 3 million keywords or more, I need to find a way to create a BloomFilter correct and fast.
Naturally, iterating such a keywords set and set each keyword to the BloomFilter is too slow to bear with, it takes probably a whole day to wait through it. So using a bit parallelism seems a good option, for example, create batch jobs, each deals with 1k keywords, and put them to the BloomFilter, but wait, Guava’s implementation isn’t thread-safe! And if you search around there’s no concurrent version of it. Alright, since my keywords are from mongodb, it still makes sense to create batch jobs in parallel to read a batch of data, then using synchronization upon the BloomFilter to put all keywords in a batch. But that’s slow too, it too.
Now comes the hack, the idea was simple, i’ll create a pool of BloomFilters, each safely assigned to a thread, and after the parallel processing is done, i’ll merge the BloomFilters into one.
It’s possible because as long as each BloomFilter uses the same hashing, in the end they put the bits in the same way, and in Guava’s case, it’s a long, and i’ll just do |= type of merge, reflection is used as no open API to get the BitArray or its internal long. And it will perform much faster, the 30 mins test is now 120secs, though yet ideal, but acceptable for now. And the code looks like this: