6.3 Suffix Arrays

This chapter under major construction.

Important note.

Beginning with Oracle and OpenJDK Java 7, Update 6, the substring() method takes linear time and space in the size of the extracted substring (instead of constant time and space). The String API provides no performance guarantees for any of its methods, including substring() and charAt().

The programs in the textbook and booksite have been updated to avoid any dependency on a constant-time substring operation. However, if you are using the third printing of the textbook (or earlier), consider yourself warned.

Suffix sorting and suffix arrays.

Suffix sorting: given a string, sort the suffixes of that string in ascending order. Resulting sorted list is called a suffix array. Program SuffixArray.java builds a sufix array data structure.

Longest repeated substring.

An application of sorting to computational biology and plagiarism detection. Program LongestRepeatedSubstring.java solves the problem using a suffix array.

Keyword in context (KWIC).

Given the suffix array, easy to search for a string or sentence via binary search. Memory is linear. Search is O(K log N) where K is the length of the string you are searching for. (Can be done in K + log N by using the lcp array.) Sometimes referred to as KWIK (key word in context) and collocations. Used by linguists. Concordancer = program to make concordance. A concordance of the Dead Sea Scrolls was famously reverse-engineered and turned into a readable text by Martin Abegg. Program KWIK.java.

Manber's algorithm.

However, the input size is N, so we might hope to exploit the overlapping nature of the substrings and do better. Program Manber.java sorts string suffixes using a version of Manber-Myers repeated doubling algorithm. Larsson gave an O(N log N) analysis when the internal sorting routine is replaced with an O(N log N) comparison based sorting algorithm.

Kasai's algorithm.

Here is a description of Kasai's algorithm for computing the lcp array from a suffix array.

Q+A

Linear time suffix sorting.

Skewed divide-and-conquer. The simplest linear time suffix sorting solution. Recursively sort the suffixes whose indices are {0, 1, 2} mod 3, and then merge.

Exercises

Give the suffix array for the string "abacadaba". Also, give a table of lcp(i) values for i = 1 to n-1.
Repeat the previous question for the string "mississippi".
How many bytes are used to create a SuffixArray object with a string of length n?
What's wrong with the following code fragment to compute all the suffixes for suffix sort?
suffix = ""; for (int i = s.length() - 1; i >= 0; i--) { suffix = s.charAt(i) + suffix; suffixes[i] = suffix; }
Answer: quadratic time and quadratic space.
What's wrong with the following code fragment to compute all the cyclic suffixes for suffix sort?
int n = s.length(); for (int i = 0; i < n; i++) suffixes[i] = s.substring(i, n) + s.substring(0, i);
Answer: quadratic time and quadratic space.

What's wrong with the following code fragment to compute all the cyclic suffixes for suffix sort?

int n = s.length;
StringBuilder ss = new StringBuilder();
ss.append(s).append(s);
for (int i = 0; i < N; i++)
    suffixes[i] = ss.substring(i, i + n);

Answer: quadratic time and quadratic space.

Longest common substring. Write a program LongestCommonSubstring.java that take two filenames as command-line arguments, reads the two text files, and find the longest substring that appears in both. Hint: create a suffix array for s#t where s and t are the two text strings and # is a character that does not appear in either.
In 1970, D. Knuth conjectured that is was impossible to solve this problem in linear time. In fact, it is possible to do it in linear time (in the worst case) using suffix trees or suffix arrays.
Burrows-Wheeler transform. The Burrows-Wheeler transform (BWT) is a transformation that is used in data compression algorithms, including bzip2 and in high-throughput sequencing in genomics. Given a text string of length N (terminated by a special end-of-file character $ that is smaller than any other character), Consider the N-by-N matrix in which each row contains a different cyclic rotation of the original text string. Sort the rows lexicographically. The Burrows-Wheeler transform is the rightmost column in the sorted matrix. For example, BWT (mississippi$) = ipssm$pissii.
Burrows-Wheeler inverse transform. The Burrows-Wheeler inverse transform (BWI) inverts the BWT. Given the BWT of a text string, design a linear-time algorithm to recover the original text string. For example, BWI(ipssm$pissii) = mississippi$.
Circular string linearization. Given a string s, find the rotation that is the smallest lexicographially. Problem arises in chemical databases for circular molecules. Each molecule is represented as a circular string. The canonical representation is the lexicographically smallest rotation. Devise an algorithm to compute this canonical representation of the circular string Hint: suffix sort.
Alternate solutions: Duval's algorithm using a Lyndon decomposition and the surprisingly elegant minimum expression algorithm by Zhou Yuan.
Accelerated rank. Use the following idea to speed up the binary search in the rank() method. Let lo and hi denote the left and right endpoints of the current search interval. Let lcpLo denote the lcp of the query string and suffixes[lo] and let lcpHi denote the lcp of the query string and suffixes[hi]. Then, when comparing the query string to suffixes[mid], only need to compare the characters starting at lcp = min(lcpLo, lcpHi) because all of the suffixes in the search interval have the same first lcp characters.
Accelerated rank analysis. Show that the the worst-case running time of the accelerated rank is still proportional to L log N, where L is the length of the query and N is the length of the text. However, Myers and Manber report that this speeds up the computation in practice. In theory, it can be improved to L + log N using non-adjacent lcp values.
Longest 3-repeated substring. Given a text string, find the longest substring that is repeated 3 or more times.
Longest k-repeated substring. Given a text string and an integer k, find the longest substring that is repeated k or more times.
Long repeated substring. Given a text string and an integer L, find all repeated substrings of length L or more.
Longest common substring among three strings. Given three strings r, s, and t, find the longest substring that appears in all three.
Longest common reverse-complemented substring. Given two DNA strings, find the longest substring that appears in one, and whose reverse Watson-Crick complement appears in the other. Two strings s and t are reverse complements if t is the reverse of s except with the following substitutions A<->T, C<->G. For example ATTTCGG and CCGAAAT are reverse complements of each other. Hint: suffix sort.
Suffix arrays with less memory. Instead of using an array of substrings where suffixes[i] referes to the ith sorted suffix, maintain an array of integers so that index[i] referes to the offset of the ith sorted suffix. To compare the substrings represented by a = index[i] and b = index[j], compare the character s.charAt(a) against s.charAt(b), s.charAt(a+1) against s.charAt(b+1), and so forth. How much memory do you save? Is your program faster?
k-gram frequency counts. Given a text string, design a data structure to efficiently answers queries of the form: how many times does a given k-gram appear? Should take time proportional to k log N in the worst case, where k is the length of the k-gram and N is the length of the text string.
Most frequent k-gram. Given a text string and an integer k, find the k-gram that appears most frequently.
Shortest unique substring. Given a text string, find a shortest substring that appears exactly once. The problem arises in bioinformatics.
Shortest non-substring. Given a bitstring, find a shortest bistring that does not appear as a substring.
Shortest unique substring. Given a text string, preprocess it to answer shortest unique substring queries of the following form: given an index q into the text string, find a shortest substring that contains index q and does not appear as a substring anywhere else in the text.