Tuesday, November 6, 2007

LaTeX

A very power formatter, you should not miss it.

Good tutorial on the website:

http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/

Monday, November 5, 2007

Associative array: Hash Table vs BST

An associative array (also map, mapping, hash, dictionary, finite map, lookup table, and in query-processing an index or index file) is an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value. The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array. The relationship between a key and its value is sometimes called a mapping or binding.

Associative arrays are usually used when lookup is the most frequent operation. For this reason, implementations are usually designed to allow speedy lookup, at the expense of slower insertion and a larger storage footprint than other data structures.

Representations: Hash Table vs Binary Search Tree (self-balancing)

There are two main efficient data structures used to represent associative arrays, the hash table and the self-balancing binary search tree. Skip lists are also an alternative, though relatively new and not as widely used. Relative advantages and disadvantages include:

  • Hash tables have faster average lookup and insertion time (O(1)), while some kinds of binary search tree have faster worst-case lookup and insertion time (O(log n) instead of O(n)). Hash tables have seen extensive use in real time systems, but trees can be useful in high-security real time systems where untrusted users may deliberately supply information that triggers worst-case performance in a hash table, although careful design can remove that issue. Hash tables shine in very large arrays, where O(1) performance is important. Skip lists have worst-case operation time of O(n), but average-case of O(log n), with much less insertion and deletion overhead than balanced binary trees.
  • Hash tables can have more compact storage for small value types, especially when the values are bits.
  • There are simple persistent versions of balanced binary trees, which are especially prominent in functional languages.
  • Building a hash table requires a reasonable hash function for the key type, which can be difficult to write well, while balanced binary trees and skip lists only require a total ordering on the keys. On the other hand, with hash tables the data may be cyclically or partially ordered without any problems.
  • Balanced binary trees and skip lists preserve ordering allowing one to efficiently iterate over the keys in order or to efficiently locate an association whose key is nearest to a given value. Hash tables do not preserve ordering and therefore cannot perform these operations as efficiently.
  • Balanced binary trees can be easily adapted to efficiently assign a single value to a large ordered range of keys, or to count the number of keys in an ordered range.
http://en.wikipedia.org/wiki/Associative_array

Sunday, November 4, 2007

Splay Tree

A splay tree is a self-balancing binary search tree. A splay tree guarantees that any M consecutive tree operations starting from an empty tree take at most O(MlogN) time. The net effect is: there is no bad input sequences.

Good performance for a splay tree depends on the fact that it is self-balancing, and indeed self optimizing, in that frequently accessed nodes will move nearer to the root where they can be accessed more quickly. This is an advantage for nearly all practical applications, and is particularly useful for implementing caches; however it is important to note that for uniform access, a splay tree's performance will be considerably (although not asymptotically) worse than a somewhat balanced simple binary search tree.

The basic idea of the splay tree is that after a node is accessed, it is pushed to the root by a series of avl tree rotations. The intuition behind this: when a node is accessed, it is likely to be accessed again in the near future in many real applications. The side effect is a balancing tree to some extent.

The splay tree does NOT require the maintenance of height or balance information, thus saving space and simplifying the code to some extent. Splay trees are much simpler to program than AVL trees.

Assume X be a node on the access path at which we are rotating. X has both a parent (P) and a grandparent (G).
Splay Operations: 1) zig-zag case, where X is a right child and P is a left child (or vice versa). we perform a double rotation, exactly like an AVL double rotation. 2) zig-zig case. X and P are both left children. 3) zig-case. X's only parent is the root.

http://en.wikipedia.org/wiki/Splay_tree

Saturday, November 3, 2007

avl Tree: balanced binary search tree

An avl (Adelson-Velskii and Landis) tree is a binary search tree with a balance condition, the height of the left and right subtrees can differ by at most 1. The depth of the tree is O(logn).

After an insertion, only nodes that are on the path from the insertion point to the root might have their balance altered because only those nodes have their subtrees altered.

Possible violation might occur in four cases:

1. An insertion into the left subtree of the left child
2. ...................................right........................................
3. .................................... left ........................ right .....
4. .................................. right ....................... right .....

1. and 4. could be fixed by single rotate

2. and 3. could be fixed by double rotate

Rotation operations only need a few pointer changes. The node will need to record the height info within it.

Deletion in avl Tree is somewhat more complicated than insertion. And lazy-deletion is probably the best strategy.


Reference: Data structures and Algorithm Analysis in C++ Page 136

B-Trees

a B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. It is most commonly used in databases and filesystems.

A B-tree is kept balanced by requiring that all leaf nodes are at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more hop further removed from the root.

B-trees have substantial advantages over alternative implementations when node access times far exceed access times within nodes. This usually occurs when most nodes are in secondary storage such as hard drives. By maximizing the number of child nodes within each internal node, the height of the tree decreases, balancing occurs less often, and efficiency increases. Usually this value is set such that each node takes up a full disk block or an analogous size in secondary storage. While 2-3 B-trees might be useful in main memory, and are certainly easier to explain, if the node sizes are tuned to the size of a disk block, the result might be a 129-513 B-tree.

Search: Search is performed in the typical manner, analogous to that in a binary search tree. Starting at the root, the tree is traversed top to bottom, choosing the child pointer whose separation values are on either side of the value that is being searched. Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.

---------------------------------------
2-3 Tree

* Every non-leaf node has 2 or 3 children
* All data is kept in the leaves
* All leaves are at the same level (the bottom level)
* All data is kept in sorted order
* Every non-leaf node will contain 1 or 2 fields.

These contain one or two fields which indicate the range of values in its subtrees. If a node has two children, it will have one field; if the node has three children, it will have two fields. Each non-leaf node will contain a value in field 1 which is greater than the largest item in its left sub-tree, but less than or equal to the smallest item in its right sub-tree (or center sub-tree, if it has three children). If that node has three children, field 2 contains a value which is greater than the largest value in the center sub-tree, but less than or equal to the smallest item in its right sub-tree. The purpose of these values is to direct a search function to the correct sub-tree and eventually to the correct data node.

The biggest drawback with the 2-3 tree is that it requires more storage space than the normal binary search tree.
---------------------------------------
2-3-4 Trees

A 2-3-4 tree is a B-tree of order 4.

2-3-4 trees are relatively difficult to implement in most programming languages because of the large number of special cases involved in operations on the tree. Red-black trees (see below) are simpler to implement, so they tend to be used instead.
---------------------------------------

http://en.wikipedia.org/wiki/B-tree
http://en.wikipedia.org/wiki/2-3_tree

Red-Black Tree

A binary search tree is a red-black tree if:

1. Every node is either red or black
2. Every leaf (nil) is black
3. If a node is red, then both its children are black
4. Every simple path from a node to a descendant leaf contains the same number of black nodes

Black-height of a node x, bh(x), is the number of black nodes on any path from x to a leaf, not counting x. A red-black tree with n internal nodes has height at most 2lg(n+1).

Thursday, November 1, 2007

Suffix Array

Suffix array: an array of pointers to characters. The pointers in the array point to every suffix in the string, hence the name "suffix array". Suffix array can be used to efficiently search a large text.

Applications:

1. Find the longest duplicated substring of characters in it.
The duplicated substring mush appear in two different suffixes. We sort the suffix array to bring together equal suffix. and then scan through this array comparing adjacent elements to find the longest repeated string. The running time is O(nlogn) due to sorting.

1.1. Given a new input string, How would you search a suffix array to find the longest match in the stored text?
1.2. Given two input texts, find the longest string that occurs in both.

2. Locate every occurrence of a substring within the string.
Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.

http://en.wikipedia.org/wiki/Suffix_array