Optimal binary search tree

Lua error in package.lua at line 80: module 'strict' not found.

In computer science, an optimal binary search tree (BST), sometimes called a weight-balanced binary tree,^[1] is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.

In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements.

In the dynamic optimality problem, the tree can be modified at any time, typically by permitting tree rotations. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant competitive ratio compared to the dynamically optimal tree in all cases, though this has not yet been proven.

Static optimality

Definition

In the static optimality problem as defined by Knuth,^[2] we are given a set of $n$ ordered elements and a set of $2n+1$ probabilities. We will denote the elements $a_1$ through $a_n$ and the probabilities $A_1$ through $A_n$ and $B_0$ through $B_n$ . $A_i$ is the probability of a search being done for element $a_i$ . For $1 \le i < n$ , $B_i$ is the probability of a search being done for an element between $a_i$ and $a_{i+1}$ , $B_0$ is the probability of a search being done for an element strictly less than $a_0$ , and $B_n$ is the probability of a search being done for an element strictly greater than $a_n$ . These $2n+1$ probabilities cover all possible searches, and therefore add up to one.

The static optimality problem is the optimization problem of finding the binary search tree that minimizes the expected search time, given the $2n+1$ probabilities. As the number of possible trees on a set of $n$ elements is ${2n \choose n}\frac{1}{n+1}$ ,^[2] which is exponential in $n$ , brute-force search is not usually a feasible solution.

Knuth's dynamic programming algorithm

In 1971, Knuth published a relatively straightforward dynamic programming algorithm capable of constructing the statically optimal tree in only O(n²) time.^[2] Knuth's primary insight was that the static optimality problem exhibits optimal substructure; that is, if a certain tree is statically optimal for a given probability distribution, then its left and right subtrees must also be statically optimal for their appropriate subsets of the distribution.

To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree on n elements is the sum of the lengths of all 2n+1 possible search paths, weighted by their respective probabilities. The tree with the minimal weighted path length is, by definition, statically optimal.

But weighted path lengths have an interesting property. Let P be the weighted path length of a binary tree, P_L be the weighted path length of its left subtree, and P_R be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements (and thus each of its search paths) is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence:

$P = P_L + P_R + W$

This recurrence leads to a natural dynamic programming solution. Let $P_{ij}$ be the weighted path length of the statically optimal search tree for all values between a_i and a_j+1, let $W_{ij}$ be the total weight of that tree, and let $R_{ij}$ be the index of its root. The algorithm can be built using the following formulas:

\begin{align} P_{ii} = W_{ii} &= B_i \operatorname{for} 0 \leq i \leq n \\ W_{ij} &= W_{i,j-1} + A_j + B_j \\ P_{i, R_{i,j-1}} + P_{R_{ij},j} &= \min_{i<k\leq j}(P_{i,k-1} + P_{kj}) = P_{ij} - W_{ij} \operatorname{for} 0 \leq i < j \leq n \end{align}

The naive implementation of this algorithm actually takes O(n³) time, but Knuth's paper includes some additional observations which can be used to produce a modified algorithm taking only O(n²) time.

Mehlhorn's approximation algorithm

While the O(n²) time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large.

In 1975, Kurt Mehlhorn published a paper proving that a much simpler algorithm could be used to closely approximate the statically optimal tree in only O(n) time.^[3] In this algorithm, the root of the tree is chosen so as to most closely balance the total weight (by probability) of the left and right subtrees. This strategy is then applied recursively on each subtree.

That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most

$2+(1 - \log(\sqrt{5} - 1))^{-1}H$

where H is the entropy of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of

$(1/\log3)H$

this approximation is very close.^[3]

Dynamic optimality

Definition

There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time.^[4] The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees,^[5] but Demaine et al. give a very good formal statement of it.^[4]

In the dynamic optimality problem, we are given a sequence of accesses x₁, ..., x_m on the keys 1, ..., n. For each access, we are given a pointer to the root of our BST and can use the pointer to perform any of the following operations:

Move the pointer to the left child of the current node.
Move the pointer to the right child of the current node.
Move the pointer to the parent of the current node.
Perform a single rotation on the current node and its parent.

Our BST algorithm can perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value x_i. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some BST algorithm which performs all accesses using the fewest total operations.

This model defines the fastest possible tree for a given sequence of accesses, but calculating the optimal tree in this sense therefore requires foreknowledge of exactly what the access sequence will be. If we let OPT(X) be the number of operations performed by the strictly optimal tree for an access sequence X, we can say that a tree is dynamically optimal as long as, for any X, it performs X in time O(OPT(X)) (that is, it has a constant competitive ratio).^[4]

There are several data structures conjectured to have this property, but none proven. It is an open problem whether there exists a dynamically optimal data structure in this model.

Splay trees

The splay tree is a data structure invented in 1985 by Daniel Sleator and Robert Tarjan which is conjectured to be dynamically optimal in the required sense. That is, a splay tree is believed to perform any sufficiently long access sequence X in time O(OPT(X)).^[5]

Tango trees

The tango tree is a data structure proposed in 2004 by Demaine et al. which has been proven to perform any sufficiently-long access sequence X in time $O(\log\log n \operatorname{OPT}(X))$ . While this is not dynamically optimal, the competitive ratio of $\log\log n$ is still very small for reasonable values of n.^[4]

Other results

In 2013, John Iacono published a paper which uses the geometry of binary search trees to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal.^[6]

Notes

↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^2.0 ^2.1 ^2.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^3.0 ^3.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^4.0 ^4.1 ^4.2 ^4.3 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^5.0 ^5.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.

[1] Lua error in package.lua at line 80: module 'strict' not found.

[Knuth1971-2] 2.0 ^2.1 ^2.2 Lua error in package.lua at line 80: module 'strict' not found.

[Mehlhorm1975-3] 3.0 ^3.1 Lua error in package.lua at line 80: module 'strict' not found.

[Demaine2004-4] 4.0 ^4.1 ^4.2 ^4.3 Lua error in package.lua at line 80: module 'strict' not found.

[SplayTrees-5] 5.0 ^5.1 Lua error in package.lua at line 80: module 'strict' not found.

[Iacono2013-6] Lua error in package.lua at line 80: module 'strict' not found.

[1]

[2]

[3]

[4]

[5]

[6]

v t e Tree data structures
Search trees (dynamic sets/associative arrays)	2–3 2–3–4 AA (a,b) AVL B B+ B* B^x (Optimal) Binary search Dancing HTree Interval Order statistic (Left-leaning) Red-black Scapegoat Splay T Treap UB Weight-balanced
Heaps	Binary Binomial Fibonacci Leftist Pairing Skew Van Emde Boas
Tries	Ctrie C-trie (compressed ADT) Hash Radix Suffix Ternary search X-fast Y-fast
Spatial data partitioning trees	BK BSP Cartesian Hilbert R k-d (implicit k-d) M Metric MVP Octree Priority R Quad R R+ R* Segment VP X
Other trees	Cover Exponential Fenwick Finger Fractal tree index Fusion Hash calendar iDistance K-ary Left-child right-sibling Link/cut Log-structured merge Merkle PQ Range SPQR Top

v t e Data structures
Types	Collection Container
Abstract	Associative array Multimap List Stack Queue Double-ended queue Priority queue Double-ended priority queue Set Multiset Disjoint-set
Arrays	Bit array Circular buffer Dynamic array Hash table Hashed array tree Sparse array
Linked	Association list Linked list Skip list Unrolled linked list XOR linked list
Trees	B-tree Binary search tree AA tree AVL tree Red–black tree Self-balancing tree Splay tree Heap Binary heap Binomial heap Fibonacci heap R-tree R* tree R+ tree Hilbert R-tree Trie Hash tree
Graphs	Binary decision diagram Directed acyclic graph Directed acyclic word graph
List of data structures

Optimal binary search tree

Contents

Static optimality

Definition

Knuth's dynamic programming algorithm

Mehlhorn's approximation algorithm

Dynamic optimality

Definition

Splay trees

Tango trees

Other results

See also

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools