Assignment 2: Binary trees
Deadline: Nov 27, 2023 14:00 CET
In this set of exercises you will practice working with binary trees. We will work with examples of binary phrase structure (syntactic analyses of natural language sentences).
Exercises
Implement the following methods in the BinaryTree
class in the template,
which represents a binary tree as a Python list (using it as an array).
Please follow the (standard) procedure for implementing binary trees
with arrays as described during the lectures.
Throughout the implementation, we use indices of the array where a node is stored as the ID of the node.
add_node()
: Add a right or left child to a given node with the given label, and return its index. This method should also extend the array when needed. The size of the array should be limited to the minimum storage needed for the current tree. An index value out of bounds indicates a non-existing node.find_node()
: Return an iterable (a sequence, or better, a generator), that returns/yields IDs (indices) of the nodes with the given label. If there is another match found at the left child (or one of its descendants) it should be returned before the node. Possible matches at the right child or its descendants should be returned after the node.get_height()
: Return the height of a node with the given index.get_depth()
: Return the depth of a node with the given index.__str__()
: Return a string representation of the binary tree similar to Penn Treebank notation (see below for details). Spaces are only needed when separating terminal nodes (words) from their parents. Other spacing (including line breaks) and indentation is optional.
For empty slots in the array
(the locations corresponding to non-existing nodes),
use None
as a placeholder.
The size of the list should be the minimum size required by the tree represented.
As usual a set of test cases are provided.
Penn Treebank format
A popular format for storing parse trees is so-called Penn treebank notation. The nodes represent either a phrase label (like NP below), or a word, or terminal (like John below). The format includes a hierarchical organization of parenthetical expression. The label of an internal node is indicated just after an opening parentheses. The leaf nodes, the words, are not placed inside parenthesis, and indicated next to their immediate parent. The unary rules (internal nodes with a single child) are only allowed if the child is a leaf node (word). Hence, every expression inside the parentheses are either two labels (a leaf node preceded by its parent, no parentheses) or a node label followed by its left and right subtree (both subtrees surrounded by parentheses). An example parse tree and its Penn treebank representation are provided below.
(S (NP John)
(VP (V ate)
(NP (Det a)
(N pizza)
)
)
)
Note that we are only concerned with binary trees of the form above (actual Penn treebank format allows n-ary trees with arbitrary n). We also assume that the terminal nodes (the words) are always the left child of their parent.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- commit all your changes
- push it to your GitHub repository