susam 3 hours ago

When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.

  • susam 2 hours ago

    I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.

    This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.

    • xandrius 2 minutes ago

      If I ever need a number near 80000, I might follow your tip and go for 80085 instead (0x138D5 in hex). Nice!

    • anonymars 2 hours ago

      Ha, collision avoidance all the way down

    • gkfasdfasdf 2 hours ago

      Interesting trick about the constant value, and thank you for the detailed write up!

  • 6510 an hour ago

    I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.

    Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.

    A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.

    a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw

    The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.

    I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.

    Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e

    Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.

    But it works and loading a hand full of files or doing an AND is amazingly fast.

    • susam 41 minutes ago

      Nice! Thanks for sharing. What you've described here sounds like an n-gram inverted index (with n = 2) represented as a bitset. We could call it a bigram bitset inverted index. Glad to know you designed and implemented all of this from first principles, and that it serves you well!

adamzwasserman 27 minutes ago

The "no sharing between filters" insight clicked for me on a different problem.

I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.

Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.

TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.

pi_22by7 15 minutes ago

The key insight about bloom filters lacking synergy is excellent. The ~7K document crossover point makes sense because inverted indexes amortize dictionary storage across all documents while bloom filters must encode it linearly per document

hijinks 33 minutes ago

is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?

sanskarix 5 hours ago

the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.

I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.

the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.

  • adamzwasserman 33 minutes ago

    Exactly. A 1.2% false positive rate means unnecessary reads 1.2% of the time vs 100% without the filter. Even at 10% FP rate, you skip 90% of I/O.

    This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.

    In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.