Massive Algorithms: Persistent Data Structures

http://www.codeproject.com/Articles/9680/Persistent-Data-Structures
another meaning for the word persistence when it is used to describe data structures, particularly those used in functional programming languages. In that context, a persistent data structure is a data structure capable of preserving the current version of itself when modified. In essence, a persistent data structure is immutable.

Persistent Singly Linked Lists

To insert a new node at the fourth position, we traverse the list as before only copying each node along the way. Each copied node is linked to the next copied node:

A node about to be inserted into a newer version of the immutable list

The last copied node is linked to the new node, and the new node is linked to the remaining nodes in the old list:

The new and old version of the immutable list

Persistent Binary Trees

This illustrates the "spine" of the search down the tree. The red nodes are the only nodes that need to be copied in making a new version of the tree:

The spine of the tree leading to the node 7

You can see that the majority of the nodes do not need to be copied. Assuming the binary tree is balanced, the number of nodes that need to be copied any time a write operation is performed is at most O(Log N), where Log is base 2.

Insertions and deletions work the same way, only steps should be taken to keep the tree in balance, such as using an AVL tree. If a binary tree becomes degenerate, we run into the same efficiency problems as we did with the singly linked list.

Random Access Lists

An interesting persistent data structure that combines the singly linked list with the binary tree is Chris Okasaki'srandom-access list. This data structure allows for random access of its items as well as adding and removing items from the beginning of the list. It is structured as a singly linked list of completely balanced binary trees. The advantage of this data structure is that it allows access, insertion, and removal of the head of the list in O(1) time as well as provides logarithmic performance in randomly accessing its items.

Immutable Collections

Stack

This one was easy. Simply create a persistent singly linked list and limit insertions and deletions to the head of the list. Since this class is persistent, popping a stack returns a new version of the stack with the next item in the old stack as the new top. In the System.Collections.Stack version, popping the stack returns the top of the stack. The question for the persistent version was how to make the top of the stack available since it cannot be returned when the stack is popped. I chose to create a Top property that represents the top of the stack.

SortedList

The SortedList uses AVL tree algorithms to keep the tree in balance. I found it useful to create an IAvlNodeinterface. Two classes implement this interface, the AvlNode class and the NullAvlNode class. TheNullAvlNode class implements the null object design pattern. This simplified many of the algorithms.

ArrayList

This is the class that proved most challenging. Like the SortedList, it uses a persistent AVL tree as its data structure. However, unlike the SortedList, items are accessed by index (or by position) rather than by key. I have to admit that the algorithms for accessing and inserting items in a binary tree by index weren't intuitive to me, so I turned to Knuth. Specifically, I used Algorithms B and C in section 6.2.3 in volume 3 of The Art of Computer Programming.

I made an assumption about the ArrayList in order to improve performance. I assumed that the Add method is by far the most used method. However, adding items to the ArrayList one right after the other causes a lot of tree rotations to keep the tree in balance. To solve this, I created a template tree that is already completely balanced. Since this template tree is immutable, it can exist at the class level and be shared amongst all of the instances of the class.

When an instance of the ArrayList class is created, it takes a small subtree of the template tree. As items are added, the nodes in the template tree are replaced with new nodes. Since the tree is completely balanced, no rebalancing is necessary. If the subtree gets filled up, another subtree of equal height is taken from the template tree and joined to the existing tree. Insertions and deletions are handled normally with rebalancing performed if necessary. Again, the assumption is that adding items to the ArrayList occurs much more frequently than inserting or deleting items.

Array

The Array class uses the random access list structure to provide a persistent array with logarithmic performance. Unlike a random access list, it has a fixed size.

http://programmers.stackexchange.com/questions/221762/why-doesnt-java-8-include-immutable-collections

Further more, immutable collections often have fundamentally different implementations than their mutable counterparts. Consider for example ArrayList, an efficient immutable ArrayListwouldn't be an array at all! It should be implemented with a balanced tree with a large branching factor, Clojure uses 32 IIRC. Making mutable collections be "immutable" by just adding a functional update is a performance bug just as much as a memory leak is.

IMMUTABLE LISTS
Instead of adding this new node to the end, we can add it to the beginning.

PERSISTENT TRIES
The high branching factor reduces the time for operations on tries
In addition to the high branching factor of thirty-two or more, tries use a special property to organize the keys. The key for a node is determined by its path. In other words, we don’t store the key in the nodes; their location is their key.

An operation to change any element requires copying only up to four elements—close to constant time operation

Persistent Data Structures

Stack

Labels

Popular Posts