# 5.3 Substring Search

This section under major construction.
Searching in long strings - online.

This website is a great resource for exact string searching algorithms.

High-performance pattern matching in Java for general string searching, searching with wildcards, and searching with character classes.

Program Brute.java is
brute force string search.
Essentially equivalent to
SystemSearch.java.

## Rabin-Karp.

Program RabinKarp.java implements the Rabin-Karp randomized fingerprint algorithm.## Knuth-Morris-Pratt.

Program KMP.java is Knuth-Morris-Pratt algorithm. KMPplus.java is an improved version that takes time and space proportional to M + N (independent of the alphabet size R).## Boyer-Moore.

Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.### Intrusion detection systems.

Need very fast string searching since these are deployed at choke points of networks. Application

#### Q+A

#### Exercises

- Design a brute-force substring search algorithm that scans the pattern from right to left.
- Show the trace of the brute-force algorithm in the style of figure XYZ
for the following pattern and text strings.
- AAAAAAAB; AAAAAAAAAAAAAAAAAAAAAAAAB
- ABABABAB; ABABABABAABABABABAAAAAAAA

- Determine the KMP DFA for the following pattern strings.
- AAAAAAAB
- AACAAAB
- ABABABAB
- ABAABAAABAAAB
- ABAABCABAABCB

- Suppose that the pattern and text are random strings over an alphabet of size R >= 2. Show that the expected number of character compares is (N - M + 1) (1 - R^-M) / (1 - R^-1) <= 2 (N - M + 1).
- Construct an example where the Boyer-Moore algorithm (with only the bad character rule) performls poorly.
- How would you modify Rabin-Karp to search for a given pattern with the additional proviso that the middle character is a "wildcard" (any text character at all can match it).
- How would you modify Rabin-Karp to determine whether any of a subset
of k patterns (say, all of the same length) is in the text?
*Solution.*Compute the hashes of the k patterns and store the hashes in a SET. - How would you modify Rabin-Karp to search for an M-by-M pattern in an N-by-N text? Or some other irregularly-shaped pattern in an N-by-N text?
- Monte Carlo vs. Las Vegas Rabin Karp.
**Online palindrome detection.**Read in characters one at a time. Report at each instant if the current string is a palindrome.*Hint*: use Karp-Rabin hashing idea.**Tandem repeats.**A tandem repeat of a base string b within a string s is a substring of s consisting of at least one consecutive copy of the base string b. Given b and s, design an algorithm to find a tandem repeat of b within s of maximum length. The running time should be proportional to M + N, where M is the length of b and N is the length of s.*Solution.*This problem is a generalization of substring search (is there at least one consecutive copy of b within s?) so we need an algorithm that generalize substring search. Create the Knuth-Morris-Pratt DFA for k concatenated copies of b, where k = n/m. Now, simulate DFA on input s and record the largest state that it reaches. From this, we can identify the longest tandem repeat.**Suffix-prefix match.**Design a linear-time algorithm to find the longest suffix of one string*a*that exactly matches a prefix of another string*b*.**Cyclic rotation.**Design a linear-time algorithm to determine whether one string is a cyclic rotation of another. A string*a*is a cyclic rotation of a string*b*if*a*and*b*have the same length and*a*consists of a suffix of*b*followed by a prefix of*b*.**Substring of a circular string.**Design a linear-time algorithm to determine whether one string*a*is a substring of a cirular string*b*.**Longest palindromic substring.**Given a string s, find the longest substring that is a palindrome (or a Watson-crick palindrome).*Solution*: can be solved in linear time using suffix trees or Manacher's algorithm. Here's a simpler solution that typically runs in linearthmic time. First, we describe how to find all palindromic substrings of length exactly L in linear time: use Karp-Rabin to iteratively form the hashes of each substring of length L (and its reverse), and compare. Since you don't know L, repeatedly double your guess of L until you know the optimal length is between L and 2L. Then use binary search to find the exactly length.*Solution.*Manacher.java is an implementation of Manacher's algorithm.**Repeated substring.**[ Mihai Patrascu] Given an integer K and a string of length N, find the longest substring which appears at least K times.*One solution.*Assume you know the length L of the repeated string. Hash each substring of length L, and check if any hash occurs K or more times. If so, check to make sure you didn't get unlucky. Since you don't know L, repeatedly double your guess of L until you know the optimal length is between L and 2L. Then use binary search to find the right value.**Longest common substring.**Given two (or three strings), find the longest substring that appears in all three.*Hint*: assume you know the length L of the longest common substring. Hash each substring of length L and check if any hash bucket contains (at least) one entry from each string.**All matches.**Modify KMP to find all matches in linear time (instead of leftmost match).**Fibonacci string.**Interesting case for KMP. F(1) = B, F(2) = A, F(3) = AB, F(4) = ABA, F(5) = ABAAB, F(N) = F(N-1) F(N-2).- Suppose that x and y are two strings.
Design a linear-time algorithm to determine whether x^m = y^n for some
integers m and n (where x^m means the concatenation of m copies of x).
*Solution.*Suffices to check where xy = yx (this fact is nontrivial - it follows from the Lyndon-Schutzenberger theorem). **Period of a string.**Let s be a nonempty string. An integer p such that s[i] = s[i+p] for all i = 0, 1, ..., N-p-1. is called a*period*of s The period of string s is the smallest integer p that is a period of s (can be N). For example, the period of ABCABCABCABCAB is 3. Design a linear-time algorithm to compute the period of a string.**Border of a string.**Given a nonempty string s, we define string w to be a*border*of s if s = yw = wz for some strings y, z, and w with |y| = |z| = p, i.e., w is a proper substring of s that is both a prefix and a suffix of s. The border of a string is the longest proper border of s (can be empty). For example, the boder of ABCABCABCABCAB is w = ABCABCAB (with y = ABC, z = CAB, and p = 3). Design a linear-time algorithm to compute the border of a string.**Anagram substring search.**Given a text string txt[] of length N and a pattern string pat[] of length M, determine whether pat[] or any of its anagrams (any of its M! permutations) appears in the text.*Hint*: maintain a histogram of the letter frequencies for a given substring of length M in the text.