5.3 Substring Search

This section under major construction. Searching in long strings - online.

This website is a great resource for exact string searching algorithms.

High-performance pattern matching in Java for general string searching, searching with wildcards, and searching with character classes.

Program Brute.java is brute force string search. Essentially equivalent to SystemSearch.java.

Rabin-Karp.

Program RabinKarp.java implements the Rabin-Karp randomized fingerprint algorithm.

Knuth-Morris-Pratt.

Program KMP.java is Knuth-Morris-Pratt algorithm. KMPplus.java is an improved version that takes time and space proportional to M + N (independent of the alphabet size R).

Boyer-Moore.

Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.

Intrusion detection systems.

Need very fast string searching since these are deployed at choke points of networks. Application

Q+A

Exercises

Design a brute-force substring search algorithm that scans the pattern from right to left.
Show the trace of the brute-force algorithm in the style of figure XYZ for the following pattern and text strings.
- AAAAAAAB; AAAAAAAAAAAAAAAAAAAAAAAAB
- ABABABAB; ABABABABAABABABABAAAAAAAA
Determine the KMP DFA for the following pattern strings.
- AAAAAAAB
- AACAAAB
- ABABABAB
- ABAABAAABAAAB
- ABAABCABAABCB
Suppose that the pattern and text are random strings over an alphabet of size R >= 2. Show that the expected number of character compares is (N - M + 1) (1 - R^-M) / (1 - R^-1) <= 2 (N - M + 1).
Construct an example where the Boyer-Moore algorithm (with only the bad character rule) performls poorly.
How would you modify Rabin-Karp to search for a given pattern with the additional proviso that the middle character is a "wildcard" (any text character at all can match it).
How would you modify Rabin-Karp to determine whether any of a subset of k patterns (say, all of the same length) is in the text?
Solution. Compute the hashes of the k patterns and store the hashes in a SET.
How would you modify Rabin-Karp to search for an M-by-M pattern in an N-by-N text? Or some other irregularly-shaped pattern in an N-by-N text?
Monte Carlo vs. Las Vegas Rabin Karp.
Online palindrome detection. Read in characters one at a time. Report at each instant if the current string is a palindrome. Hint: use Karp-Rabin hashing idea.
Tandem repeats. A tandem repeat of a base string b within a string s is a substring of s consisting of at least one consecutive copy of the base string b. Given b and s, design an algorithm to find a tandem repeat of b within s of maximum length. The running time should be proportional to M + N, where M is the length of b and N is the length of s.
Solution. This problem is a generalization of substring search (is there at least one consecutive copy of b within s?) so we need an algorithm that generalize substring search. Create the Knuth-Morris-Pratt DFA for k concatenated copies of b, where k = n/m. Now, simulate DFA on input s and record the largest state that it reaches. From this, we can identify the longest tandem repeat.
Suffix-prefix match. Design a linear-time algorithm to find the longest suffix of one string a that exactly matches a prefix of another string b.
Cyclic rotation. Design a linear-time algorithm to determine whether one string is a cyclic rotation of another. A string a is a cyclic rotation of a string b if a and b have the same length and a consists of a suffix of b followed by a prefix of b.
Substring of a circular string. Design a linear-time algorithm to determine whether one string a is a substring of a cirular string b.
Longest palindromic substring. Given a string s, find the longest substring that is a palindrome (or a Watson-crick palindrome). Solution: can be solved in linear time using suffix trees or Manacher's algorithm. Here's a simpler solution that typically runs in linearthmic time. First, we describe how to find all palindromic substrings of length exactly L in linear time: use Karp-Rabin to iteratively form the hashes of each substring of length L (and its reverse), and compare. Since you don't know L, repeatedly double your guess of L until you know the optimal length is between L and 2L. Then use binary search to find the exactly length.
Solution. Manacher.java is an implementation of Manacher's algorithm.
Repeated substring. [ Mihai Patrascu] Given an integer K and a string of length N, find the longest substring which appears at least K times.
One solution. Assume you know the length L of the repeated string. Hash each substring of length L, and check if any hash occurs K or more times. If so, check to make sure you didn't get unlucky. Since you don't know L, repeatedly double your guess of L until you know the optimal length is between L and 2L. Then use binary search to find the right value.
Longest common substring. Given two (or three strings), find the longest substring that appears in all three. Hint: assume you know the length L of the longest common substring. Hash each substring of length L and check if any hash bucket contains (at least) one entry from each string.
All matches. Modify KMP to find all matches in linear time (instead of leftmost match).
Fibonacci string. Interesting case for KMP. F(1) = B, F(2) = A, F(3) = AB, F(4) = ABA, F(5) = ABAAB, F(N) = F(N-1) F(N-2).
Suppose that x and y are two strings. Design a linear-time algorithm to determine whether x^m = y^n for some integers m and n (where x^m means the concatenation of m copies of x).
Solution. Suffices to check where xy = yx (this fact is nontrivial - it follows from the Lyndon-Schutzenberger theorem).
Period of a string. Let s be a nonempty string. An integer p such that s[i] = s[i+p] for all i = 0, 1, ..., N-p-1. is called a period of s The period of string s is the smallest integer p that is a period of s (can be N). For example, the period of ABCABCABCABCAB is 3. Design a linear-time algorithm to compute the period of a string.
Border of a string. Given a nonempty string s, we define string w to be a border of s if s = yw = wz for some strings y, z, and w with |y| = |z| = p, i.e., w is a proper substring of s that is both a prefix and a suffix of s. The border of a string is the longest proper border of s (can be empty). For example, the boder of ABCABCABCABCAB is w = ABCABCAB (with y = ABC, z = CAB, and p = 3). Design a linear-time algorithm to compute the border of a string.
Anagram substring search. Given a text string txt[] of length N and a pattern string pat[] of length M, determine whether pat[] or any of its anagrams (any of its M! permutations) appears in the text.
Hint: maintain a histogram of the letter frequencies for a given substring of length M in the text.