5.3 Substring Search
This section under major construction.
Searching in long strings - online.
This website is a great resource for exact string searching algorithms.
High-performance pattern matching in Java for general string searching, searching with wildcards, and searching with character classes.
Program Brute.java is
brute force string search.
Essentially equivalent to
SystemSearch.java.
Rabin-Karp.
Program RabinKarp.java implements the Rabin-Karp randomized fingerprint algorithm.Knuth-Morris-Pratt.
Program KMP.java is Knuth-Morris-Pratt algorithm. KMPplus.java is an improved version that takes time and space proportional to M + N (independent of the alphabet size R).Boyer-Moore.
Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.Intrusion detection systems.
Need very fast string searching since these are deployed at choke points of networks. Application
Q+A
Exercises
- Design a brute-force substring search algorithm that scans the pattern from right to left.
- Show the trace of the brute-force algorithm in the style of figure XYZ
for the following pattern and text strings.
- AAAAAAAB; AAAAAAAAAAAAAAAAAAAAAAAAB
- ABABABAB; ABABABABAABABABABAAAAAAAA
- Determine the KMP DFA for the following pattern strings.
- AAAAAAAB
- AACAAAB
- ABABABAB
- ABAABAAABAAAB
- ABAABCABAABCB
- Suppose that the pattern and text are random strings over an alphabet of size R >= 2. Show that the expected number of character compares is (N - M + 1) (1 - R^-M) / (1 - R^-1) <= 2 (N - M + 1).
- Construct an example where the Boyer-Moore algorithm (with only the bad character rule) performls poorly.
- How would you modify Rabin-Karp to search for a given pattern with the additional proviso that the middle character is a "wildcard" (any text character at all can match it).
- How would you modify Rabin-Karp to determine whether any of a subset
of k patterns (say, all of the same length) is in the text?
Solution. Compute the hashes of the k patterns and store the hashes in a SET.
- How would you modify Rabin-Karp to search for an M-by-M pattern in an N-by-N text? Or some other irregularly-shaped pattern in an N-by-N text?
- Monte Carlo vs. Las Vegas Rabin Karp.
- Online palindrome detection. Read in characters one at a time. Report at each instant if the current string is a palindrome. Hint: use Karp-Rabin hashing idea.
- Tandem repeats.
A tandem repeat of a base string b within a string s is a substring
of s consisting of at least one consecutive copy of the base
string b. Given b and s, design an algorithm to
find a tandem repeat of b within s of maximum length.
The running time should be proportional to M + N, where
M is the length of b and N is the length of s.
Solution. This problem is a generalization of substring search (is there at least one consecutive copy of b within s?) so we need an algorithm that generalize substring search. Create the Knuth-Morris-Pratt DFA for k concatenated copies of b, where k = n/m. Now, simulate DFA on input s and record the largest state that it reaches. From this, we can identify the longest tandem repeat.
- Suffix-prefix match. Design a linear-time algorithm to find the longest suffix of one string a that exactly matches a prefix of another string b.
- Cyclic rotation. Design a linear-time algorithm to determine whether one string is a cyclic rotation of another. A string a is a cyclic rotation of a string b if a and b have the same length and a consists of a suffix of b followed by a prefix of b.
- Substring of a circular string. Design a linear-time algorithm to determine whether one string a is a substring of a cirular string b.
- Longest palindromic substring.
Given a string s, find the longest substring that is a palindrome
(or a Watson-crick palindrome).
Solution: can be solved in linear time using suffix trees
or Manacher's algorithm.
Here's a simpler solution that typically runs in linearthmic time.
First, we describe how to find all palindromic substrings
of length exactly L in linear time: use Karp-Rabin to iteratively
form the hashes of each substring of length L (and its reverse),
and compare.
Since you don't know L, repeatedly double your guess of L until
you know the optimal length is between L and 2L. Then use
binary search to find the exactly length.
Solution. Manacher.java is an implementation of Manacher's algorithm.
- Repeated substring. [
Mihai Patrascu]
Given an integer K and a string of length N, find the longest substring which
appears at least K times.
One solution. Assume you know the length L of the repeated string. Hash each substring of length L, and check if any hash occurs K or more times. If so, check to make sure you didn't get unlucky. Since you don't know L, repeatedly double your guess of L until you know the optimal length is between L and 2L. Then use binary search to find the right value.
- Longest common substring. Given two (or three strings), find the longest substring that appears in all three. Hint: assume you know the length L of the longest common substring. Hash each substring of length L and check if any hash bucket contains (at least) one entry from each string.
- All matches. Modify KMP to find all matches in linear time (instead of leftmost match).
- Fibonacci string. Interesting case for KMP. F(1) = B, F(2) = A, F(3) = AB, F(4) = ABA, F(5) = ABAAB, F(N) = F(N-1) F(N-2).
- Suppose that x and y are two strings.
Design a linear-time algorithm to determine whether x^m = y^n for some
integers m and n (where x^m means the concatenation of m copies of x).
Solution. Suffices to check where xy = yx (this fact is nontrivial - it follows from the Lyndon-Schutzenberger theorem).
- Period of a string. Let s be a nonempty string. An integer p such that s[i] = s[i+p] for all i = 0, 1, ..., N-p-1. is called a period of s The period of string s is the smallest integer p that is a period of s (can be N). For example, the period of ABCABCABCABCAB is 3. Design a linear-time algorithm to compute the period of a string.
- Border of a string. Given a nonempty string s, we define string w to be a border of s if s = yw = wz for some strings y, z, and w with |y| = |z| = p, i.e., w is a proper substring of s that is both a prefix and a suffix of s. The border of a string is the longest proper border of s (can be empty). For example, the boder of ABCABCABCABCAB is w = ABCABCAB (with y = ABC, z = CAB, and p = 3). Design a linear-time algorithm to compute the border of a string.
- Anagram substring search.
Given a text string txt[] of length N and a pattern string pat[] of length M,
determine whether pat[] or any of its anagrams (any of its M! permutations)
appears in the text.
Hint: maintain a histogram of the letter frequencies for a given substring of length M in the text.