Assignment 3: Binary heaps
Deadline: Dec 4, 2023 14:00 CET
In this set of exercises you will practice working with binary heaps. We will use a binary max-heap to implement an application for ranking similar words.
Word Embeddings
To be able to compare how similar words are, we will be making use of word embeddings. Without delving too much into the details, word embeddings are essentially high-dimensional numerical vectors. Each vector represents the meaning of a word. They are obtained from data based on the neighbours of each word (Distributional semantics: a word’s meaning is defined by its distribution). To visualise them, let’s assume three two-dimensional embeddings:
word | 1st dimension | 2nd dimension |
---|---|---|
dog | 0.91250932 | 0.4090559 |
cat | 0.85749293 | 0.51449576 |
chair | 0.12403473 | 0.99227788 |
We can plot them as follows:
As you can see, words with similar meaning will have closer embeddings. We can thus compute the (dis)similarity between two words by calculating the distance between the tip of their embeddings (Euclidean distance): the smaller this distance, the more similar are the two words.
The file food.json
contains a json dictionary mapping food-related words to their embeddings. You do not have to worry about loading them in: this is taken care of by the provided __main__
block and Embedding
class. Just make sure to understand the provided code before starting the task.
Exercises
As usual a set of test cases are provided.
Exercise 1
Similarly to a2, implement the following methods in the BinaryHeap
class in the template, which represents a binary heap as a Python list (using it as an array). Refer to the lecture slides about binary heaps.
get_parent()
: Return index of parent node. ReturnNone
if the given node is the root.get_max_child()
: Return index of child node with the bigger label. ReturnNone
if the node does not have any children.push()
: Add a new node with label to the heap according to thekey
provided in the constructor.pop()
: Remove and return the maximum value from the heap. Make sure the size of the list also decreases!
The size of the underlying list should be the same as the number of nodes in the binary heap.
The key
parameter is essentially a function. The BinaryHeap
will use it to determine by what criterion we will be comparing our embeddings. An example with the python method sorted()
:
person2age = {"Giulio": 21, "Daniele": 20}
person2age.items() # convert dict to list of key-value pairs
>>> [("Giulio", 21), ("Daniele", 20)]
criterion = lambda x: x[1] # a lambda is an in-line function definition
criterion(("Giulio", 21)) # given x return x[1]
>>> 21
sorted(person2age.items(), key=criterion) # sort pairs by criterion
>>> [("Daniele", 20), ("Giulio", 21)]
Exercise 2
Given a particular word, embeddings can be used for finding words with similar meanings. A brute-force approach would require k * n operations for finding k-most-similar words among n words. Using a heap data structure, we can reduce this to n * log k.
Implement the find_similar()
function in the template. This function takes a target_embedding
and a list of candidate embeddings
and returns the k
most similar words to that target. Make sure to specify here the key
required by the BinaryHeap
.
Your implementation should work faster than the brute-force approach.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- commit all your changes
- push it to your GitHub repository