Assignment 6: Bloom filter

Deadline: Dec. 27, 2023, 14:00 CET

In this assignment, we will work with hash functions for duplicate detection. In particular, we will use a Bloom filter for memory efficient approximate duplicate detection. We recommend watching this introductory video before you start working on the solution.

A Bloom filter is a bit array, initialized with all 0s. When an object is added to the filter, we run k (independent) hash functions to get k hash codes for the input object. These hash codes will be the indices where the Bloom filter’s bits will be set to 1. To make sure that the hash codes do not go out of bounds (when they are bigger than the filter’s length), we apply the modulo operation.

For example, assume we have two hash functions h1 and h2, both returning 32 bit hash codes, and we are using a Bloom filter of size=50. For a given object s we obtain h1(s) and h2(s), take their module 50, and update the bits corresponding to these two indices to 1. In reality we would have a much bigger filter size and a few more hash functions.

When checking if an object is in the filter, we run the k hash functions and check for the bits at the indices of the hash codes. If they are all set to 1, we state that the object is probably in the filter. Otherwise the object is not in the filter.

A Bloom filter allows checking set membership (e.g., words in a text) without keeping the objects themselves. It is also more space-efficient in comparison to keeping the set of hash values of the objects, without causing many more false positives. The size of the filter and the number of hash functions determines the number of collisions. For a filter size m that will be used for n items, the optimum number of hash functions is ln 2 * m / n.

In this assignment, we will use a Bloom filter to detect duplicate words. Detecting duplicate words, sentences, documents is a common problem in computational linguistics. For large corpora (with billions of items), Bloom filter can provide a memory efficient deduplication method with some cost of false positives (discarding some non-duplicate elements).

Starter code

template: a6.py
tests: test_a6.py
data: lexicon.txt

Exercises

Exercise 1

Implement the cyclic_hash function in the template (refer to the slides about cyclic-shift hash). You are free to implement any valid (cyclic) hash function, but you should try to avoid collisions as much as possible.

Exercise 2

Implement the following methods in the BloomFilter class in the template. They do not require many lines of code to complete (one might even be enough for some), so do not worry thinking you might be doing something wrong!

2.1 _hash(): Use the cyclic_hash() function and its parameters shift or start (or both) to get k variations of this hash function. Then use these functions to return the k different hash codes for this word.
2.2 contains(): Check whether the Bloom filter contains a word or not. False positives might happen.
2.3 add(): Add a word to the Bloom filter.

Then:

2.4 experiment with the BloomFilter’s size and k parameters to minimize collisions. The false-positive rate should be less than 0.1 for the provided word list with duplicates.

Wrapping up

Do not forget to

indicate your name(s) in the file header
add the honor code (I/We pledge that this code represents my/our own work)
commit all your changes
push it to your GitHub repository