Assignment 3: Binary heaps

Deadline: Dec 4, 2023 14:00 CET

In this set of exercises you will practice working with binary heaps. We will use a binary max-heap to implement an application for ranking similar words.

Word Embeddings

To be able to compare how similar words are, we will be making use of word embeddings. Without delving too much into the details, word embeddings are essentially high-dimensional numerical vectors. Each vector represents the meaning of a word. They are obtained from data based on the neighbours of each word (Distributional semantics: a word’s meaning is defined by its distribution). To visualise them, let’s assume three two-dimensional embeddings:

word	1st dimension	2nd dimension
dog	0.91250932	0.4090559
cat	0.85749293	0.51449576
chair	0.12403473	0.99227788

We can plot them as follows:

drawing

As you can see, words with similar meaning will have closer embeddings. We can thus compute the (dis)similarity between two words by calculating the distance between the tip of their embeddings (Euclidean distance): the smaller this distance, the more similar are the two words.

The file food.json contains a json dictionary mapping food-related words to their embeddings. You do not have to worry about loading them in: this is taken care of by the provided __main__ block and Embedding class. Just make sure to understand the provided code before starting the task.

Exercises

As usual a set of test cases are provided.

Exercise 1

Similarly to a2, implement the following methods in the BinaryHeap class in the template, which represents a binary heap as a Python list (using it as an array). Refer to the lecture slides about binary heaps.

get_parent(): Return index of parent node. Return None if the given node is the root.
get_max_child(): Return index of child node with the bigger label. Return None if the node does not have any children.
push(): Add a new node with label to the heap according to the key provided in the constructor.
pop(): Remove and return the maximum value from the heap. Make sure the size of the list also decreases!

The size of the underlying list should be the same as the number of nodes in the binary heap.

The key parameter is essentially a function. The BinaryHeap will use it to determine by what criterion we will be comparing our embeddings. An example with the python method sorted():

person2age = {"Giulio": 21, "Daniele": 20} 
person2age.items()  # convert dict to list of key-value pairs
>>> [("Giulio", 21), ("Daniele", 20)]
criterion = lambda x: x[1]  # a lambda is an in-line function definition
criterion(("Giulio", 21))  # given x return x[1]
>>> 21
sorted(person2age.items(), key=criterion)  # sort pairs by criterion
>>> [("Daniele", 20), ("Giulio", 21)]

Exercise 2

Given a particular word, embeddings can be used for finding words with similar meanings. A brute-force approach would require k * n operations for finding k-most-similar words among n words. Using a heap data structure, we can reduce this to n * log k.

Implement the find_similar() function in the template. This function takes a target_embedding and a list of candidate embeddings and returns the k most similar words to that target. Make sure to specify here the key required by the BinaryHeap.

Your implementation should work faster than the brute-force approach.

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository