Thursday, November 1, 2007

Suffix Array

Suffix array: an array of pointers to characters. The pointers in the array point to every suffix in the string, hence the name "suffix array". Suffix array can be used to efficiently search a large text.

Applications:

1. Find the longest duplicated substring of characters in it.
The duplicated substring mush appear in two different suffixes. We sort the suffix array to bring together equal suffix. and then scan through this array comparing adjacent elements to find the longest repeated string. The running time is O(nlogn) due to sorting.

1.1. Given a new input string, How would you search a suffix array to find the longest match in the stored text?
1.2. Given two input texts, find the longest string that occurs in both.

2. Locate every occurrence of a substring within the string.
Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.

http://en.wikipedia.org/wiki/Suffix_array

No comments: