On Bloom Filters

March 15th, 2008

I’ve got to admit that I don’t come from a hard core CS background (and I’m sure it shows in some of my articles on more complicated topics :P ), but I know a little bit about data structures and algorithms. While I’m no expert on the topic of data structures, I was surprised that – up until now – I had never even heard of Bloom filters.

What are Bloom filters?
Let’s say you have a set of millions of records in a database. Running a query over such a database can be quite time consuming. It would be great if we a cheap way to determine if a given record is likely to be within the database without necessarily performing a query. Bloom filters to the rescue!

Rather than executing a complex database query, we can ask our Bloom filter whether or not a certain record exists. It will then respond with either a definitive “no” (i.e. the record is definitely not in the database) or a “maybe”. In the event our Bloom filter says “maybe”, we must (unfortunately) query the database to determine whether or not the record really exists.

While this may sound like double the workload for our poor application, querying the Bloom filter is likely to be very fast. The goal of the Bloom filter here is to reduce the number of times we actually hit the database, a way to determine the probability a given input exists in our data set. Bloom filters are also small in terms of memory usage, requiring nothing more than a simple bitfield.

A quick overview

Bloom filters primarily consist of a bitfield of size m and one or more (call it k) hash functions. The classic Bloom filter provides two main operations:

  1. add, which is used to add new records to the Bloom filter; and
  2. query, which is used to determine whether a record is likely to be found in the data set.

Starting out with a zero-value bitfield, each time we call add our hash functions are each called on the input record in turn to produce a set of k indices (one for each hash function). For each one of these indices, we set the corresponding bit in the bitfield.

To query this filter, we again generate k indices using our hash functions on the input record. Rather than setting the bits in the bitfield, we instead check each bit. If any one of the bits is not set, then there is zero chance our input record exists in the data set (a definitive “no”). Otherwise, if all k bits are set, we have a “maybe”.

Problems with the classic implementation

There’s one glaring problem with the classic Bloom filter: there are no deletions! If records in your source data set can be deleted, your Bloom filter may need to be regenerated (depending on your accuracy needs). There exists a variation on the classic filter which uses an array of counters instead of a bitfield. The add operation becomes a counter increment. This allows for a delete operation which, predictably, becomes a decrement of the counter. This variant trades memory for flexibility.

A reference implementation

I’m uploading a C reference implementation of a Bloom filter with 1 (super naive) hash function and a 32-bit-long bitfield. Hopefully there is somebody out there who finds it informative.

Download my crummy (but hopefully instructive!) reference implementation here.

WARNING I wrote this implementation from a quick read of the Wikipedia page, so if there are any CS geeks out there who can see I’m clearly doing something wrong, please tell me how wrong I am :)

Categories: C, Software Development | 3 Comments

Fixing Rails Nested Forms (or: HashWithIndifferentAccess is evil)

March 12th, 2008

I’ve moved over to Shine’s Ruby on Rails team and, as such, have been exposed to a whole lot of Rails code over the past few weeks. Something I ran into a few weeks back was this bug, relating to a parsing bug in nested forms. I’ve submitted a series of patches, the last of which I hope to see in HEAD sometime over the next few days.

Anyway, the actual bug was caused by weirdo (and non-Hash-like) semantics in HashWithIndifferentAccess that causes certain types (Hash and Array) to be stored as a copy of the value passed in. The following code will give you an idea of the crazy logic that might result from such behavior:

def mktom(); {:name => 'Tom', :age => 23}; end

tom = mktom
people = {:tom => tom}
tom[:name] = 'Thomas'
puts people[:tom][:name] # displays 'Thomas' - cool!
puts "The same object" if tom.object_id == people[:tom].object_id # displays 'The same object'

tom = mktom
people = {:tom => tom}.with_indifferent_access
tom[:name] = 'Thomas'
puts people[:tom][:name] # displays 'Tom' - wtf!
puts "Not the same object!" if tom.object_id != people[:tom].object_id # displays 'Not the same object!'

The lesson learned from tracking down and fixing this bug?

  • Do not treat HashWithIndifferentAccess as though it were any old Hash. It has different semantics to Hash and may store copies of the keys and/or values instead of the original objects.
  • Do not nest HashWithIndifferentAccess instances in Hash instances unless you enjoy headaches. It’s so easy to think that HashWithIndifferentAccess is semantically identical to Hash (especially thanks to Hash#with_indifferent_access), but this assumption was the cause of two nasty, hard-to-find bugs in this instance.
  • When providing an API virtually identical to (and easily mistakable for) core Ruby classes, ensure that semantics are consistent with convention.

I’m not sure why they didn’t just override the reader for #[]. Maybe there’s more to that than I realize. Anyway, I’ve spent enough time thinking about this for tonight, time for bed. Sorry for the blog drought, I’m hoping to finalize a few more installments of the Ocaml series in along with the final part of my Scala/BCEL parser tutorial over the next few months. Stay tuned. :)

UPDATE Since I wrote this post, I’ve written a plugin that replaces Rails’ parameter parsing implementation to make dealing with complex nested forms easier still. Refer to Taking the Pain Out of Complex Forms in Rails.

Categories: Ruby, Software Development | No Comments