5.4 Regular Expressions

This section under major construction.

Regular expressions.

NFA.java, DFS.java, Digraph.java, and GREP.java.

Running time.

M = length of expression, N = length of input. Regular expression matching algorithm can create NFA in O(M) time and simulate input in O(MN) time.

Library implementations.

Validate.java.

Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs. Such inputs can be surprisingly simple. For example, to determine whether a string of length N is matched by the regular expression (a|aa)*b can take an amount of time exponential in N if the string is chosen carefully. The table below illustrates just how spectacularly that the Java 1.4.2 regular expression can fail.

java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaac             1.6 seconds
java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac           3.7 seconds
java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac         9.7 seconds
java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac      23.2 seconds
java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac    62.2 seconds
java Validate "(a|aa)*b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 161.6 seconds

java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaac           1.28
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaac          2.45
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaac         4.54
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaaac        8.84
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaaaac      17.74
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaaaaac     33.77
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaaaaaac    67.72
java Validate "(a*)*|b*" aaaaaaaaaaaaaaaaaaaaaaaaaaaac  134.73

The above examples are artificial, but they illustrate an alarming defect in most regular expression libraries. Bad inputs do occur in practice. According to Crosby and Wallach, the following regular expression appears in a version of SpamAssassin, a powerful spam filtering program.

[a-z]+@[a-z]+([a-z\.]+\.)+[a-z]+

It attempts to match certain email addresses, but it takes an exponential time to match some strings in many regular expression libraries including Sun's Java 1.4.2.

java Validate "[a-z]+@[a-z]+([a-z\.]+\.)+[a-z]+" spammer@x......................

This is especially significant because a spammer could use a pathological return email addresses to denial-of-service attack a mail server that has SpamAssassin running. This particular pattern is now fixed because Perl 5 regular expresssions use an internal cache to short-circuit repeated matches at the same location during backtracking.

These deficiencies are not limited to Java's implementation. For example, GNU regex-0.12 takes exponential time for matching strings of the form aaaaaaaaaaaaaac with the regular expression (a*)*|b*. Sun's Java 1.4.2 is equally susceptible to this one. Moreover, Java and Perl regular expressions support back references - the regular expression pattern matching problem for such extended regular expressions is NP hard, so this exponential blowup appears to be inherent on some inputs.

Here's one I actually wrote to try to find the last word before the string NYSE: regexp = "([\\w\\s]+).*NYSE";

Reference: Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Compares Thompson NFA with backtracking approach. Contains some performance optimizations for Thompson NFA. Also some historical notes and references.

Q + A

Q. Documentation on Java regular expression libraries?

A. Here is Oracle's guide to using regular expressions. It includes many more operations that we will not explore. Also see the String methods matches(), split(), and replaceAll(). These are shorthands for using the Pattern and Matcher classes. Here's some common regular expression patterns.

Q. Industrial grade regular expressions for email addresses, Java identifiers, integers, decimal, etc.?

A. Here's a library of useful regular expressions that offers industrial grade patterns for email addresses, URLs, numbers, dates, and times. Try this regular expression tool.

Q. I'm confused why does (a | b)* match all strings of a's and b's, instead of only string with all a's or string with all b's?

A. The * operator replicates the regular expression (and not a fixed string that matches the regular expression). So the above is equivalent to ε | (a|b) | (a|b)(a|b) | (a|b)(a|b)(a|b) | ....

Q. History?

A. In the 1940s, Warren McCulloch and Walter Pitts modeled neurons as finite automata to describe the nervous system. In 1956, Steve Kleene invented a mathematical abstraction called regular sets to describe these models. Representation of events in nerve nets and finite automata. in Automata Studies, 3-42, Princeton University Press, Princeton, New Jersey 1956.

Q. Any tools for visualizing regular expressions?

A. Try Debuggerx.

Exercises

Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. 0 or 11 or 101
2. only 0s
Answers: 0 | 11 | 101, 0*
Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. all binary strings
2. all binary strings except empty string
3. begins with 1, ends with 1
4. ends with 00
5. contains at least three 1s
Answers: (0|1)*, (0|1)(0|1)*, 1 | 1(0|1)*1, (0|1)*00, (0|1)*1(0|1)*1(0|1)*1(0|1)* or 0*10*10*1(0|1)*.
Write a regular expression to describe inputs over the alphabet {a, b, c} that are in sorted order. Answer: a*b*c*.
Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. contains at least three consecutive 1s
2. contains the substring 110
3. contains the substring 1101100
4. doesn't contain the substring 110
Answers: (0|1)*111(0|1)*, (0|1)*110(0|1)*, (0|1)*1101100(0|1)*, (0|10)*1*. The last one is by far the trickiest.
Write a regular expression for binary strings with at least two 0s but not consecutive 0s.
Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. has at least 3 characters, and the third character is 0
2. number of 0s is a multiple of 3
3. starts and ends with the same character
4. odd length
5. starts with 0 and has odd length, or starts with 1 and has even length
6. length is at least 1 and at most 3
Answers: (0|1)(0|1)0(0|1)*, 1* | (1*01*01*01*)*, 1(0|1)*1 | 0(0|1)*0 | 0 | 1, (0|1)((0|1)(0|1))*, 0((0|1)(0|1))* | 1(0|1)((0|1)(0|1))*, (0|1) | (0|1)(0|1) | (0|1)(0|1)(0|1).
For each of the following, indicate how many bit strings of length exactly 1000 are matched by the regular expression: 0(0 | 1)*1, 0*101*, (1 | 01)*.
Write a regular expression that matches all strings over the alphabet {a, b, c} that contain:
1. starts and ends with a
2. at most one a
3. at least two a's
4. an even number of a's
5. number of a's plus number of b's is even
Find long words whose letters are in alphabetical order, e.g., almost and beefily. Answer: use the regular expression '^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$'.
Write a Java regular expression to match phone numbers, with or without area codes. The area codes should be of the form (609) 555-1234 or 555-1234.
Find all English words that end with nym.
Final all English words that contain the trigraph bze. Answer: subzero.
Find all English words that start with g, contain the trigraph pev and end with e. Answer: grapevine.
Find all English words that contain the trigraph spb and have at least two r's.
Find the longest English word that can be written with the top row of a standard keyboard. Answer: proprietorier.
Find all words that contain the four letters a, s, d, and f, not necessarily in that order. Solution: cat words.txt | grep a | grep s | grep d | grep f.
Given a string of A, C, T, and G, and X, find a string where X matches any single character, e.g., CATGG is contained in ACTGGGXXAXGGTTT.
Write a Java regular expression, for use with Validate.java, that validates Social Security numbers of the form 123-45-6789. Hint: use \d to represent any digit. Answer: [0-9]{3}-[0-9]{2}-[0-9]{4}.
Modify the previous exercise to make the - optional, so that 123456789 is considered a legal input.
Write a Java regular expression to match all strings that contain exactly five vowels and the vowels are in alphabetical order. Answer: [^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
Write a Java regular expression to match valid Windows XP file names. Such a file name consists of any sequence of characters other than
/ \ : * ? " < > |
Additionally, it cannot begin with a space or period.
Write a Java regular expression to match valid OS X file names. Such a file name consists of any sequence of characters other than a colon. Additionally, it cannot begin with a period.
Given a string s that represents the name of an IP address in dotted quad notation, break it up into its constituent pieces, e.g., 255.125.33.222. Make sure that the four fields are numeric.
Write a Java regular expression to describe all dates of the form Month DD, YYYY where Month consists of any string of upper or lower case letters, the date is 1 or 2 digits, and the year is exactly 4 digits. The comma and spaces are required.
Write a Java regular expression to describe valid IP addresses of the form a.b.c.d where each letter can represent 1, 2, or 3 digits, and the periods are required. Yes: 196.26.155.241.
Write a Java regular expression to match license plates that start with 4 digits and end with two uppercase letters.
Write a regular expression to extract the coding sequence from a DNA string. It starts with the ATG codon and ends with a stop codon (TAA, TAG, or TGA). reference
Write a regular expression to check for the sequence rGATCy: that is, does it start with A or G, then GATC, and then T or C.
Write a regular expression to check whether a sequence contains two or more repeats of the the GATA tetranucleotide.
Modify Validate.java to make the searches case insensitive. Hint: use the (?i) embedded flag.
Write a Java regular expression to match various spellings of Libyan dictator Moammar Gadhafi's last name using the folling template: (i) starts with K, G, Q, (ii) optionally followed by H, (iii) followed by AD, (iv) optionally followed by D, (v) optionally followed by H, (vi) optionally followed by AF, (vii) optionally followed by F, (vii) ends with I.
Write a Java program that reads in an expression like (K|G|Q)[H]AD[D][H]AF[F]I and prints out all matching strings. Here the notation [x] means 0 or 1 copy of the letter x.
Why doesn't s.replaceAll("A", "B"); replace all occurrences of the letter A with B in the string s?
Answer: Use s = s.replaceAll("A", "B"); instead The method replaceAll returns the resulting string, but does not change s itself. Strings are immutable.
Write a program Clean.java that reads in text from standard input and prints it back out, removing any trailing whitespace on a line and replacing all tabs with 4 spaces.
Hint: use replaceAll() and the regular expression \s for whitespace.
Write a regular expression to match all of the text between the text a href =" and the next ". Answer: href=\"(.*?)\". The ? makes the .* reluctant instead of greedy. In Java, use Pattern.compile("href=\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE) to escape the backslash characters.
Use regular expressions to extract all of the text between the tags <title> and <\title>. The (?i) is another way to make the match case insensitive. The $2 refers to the second captured subsequence, i.e., the stuff between the title tags.
String pattern = "(?i)(<title.*?>)(.+?)(</title>)"; String updated = s.replaceAll(pattern, "$2");
Write a regular expression to match all of the text between <TD ...> and </TD> tags. Answer: <TD[^>]*>([^<]*)</TD>

Creative Exercises

FMR-1 triplet repeat region. "The human FMR-1 gene sequence contains a triplet repeat region in which the sequence CGG or AGG is repeated a number of times. The number of triplets is highly variable between individuals, and increased copy number is associated with fragile X syndrome, a genetic disease that causes intellectual disability and other symptoms in one out of 2000 children." (Reference: Biological Sequence Analysis by Durbin et al). The pattern is bracket by GCG and CTG, so we get the regular expression GCG (CGG | AGG)* CTG.
Ad blocking. Adblock uses regular expressions to block banner adds under the Mozilla and Firebird browsers.
Parsing text files. A more advanced example where we want to extract specific pieces of the matching input. This program typifies the process of parsing scientific input data.
PROSITE to Java regular expression. Write a program to read in a PROSITE pattern and print out the corresponding Java regular expression. PROSITE is the "first and most famous" database of protein families and domains. Its main use it to determine the function of uncharacterized proteins translated from genomic sequences. Biologists use PROSITE pattern syntax rules to search for patterns in biological data. Here is the raw data for CBD FUNGAL (accession code PS00562). Each line contains various information. Perhaps the most interesting line is the one that begins with PA - it contains the pattern that describes the protein motif. Such patterns are useful because they often correspond to functional or structural features.
PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C.
Each uppercase letter corresponds to one amino acid residue. The alphabet consists of uppercase letters corresponding to the 2x amino acids. The - character means concatenation. For example, the pattern above begins with CGG (Cys-Gly-Gly). The notation x plays the role of a wildcard - it matches any amino acid. This corresponds to . in our notation. Parentheses are used to specify repeats: x(2) means exactly two amino acids, and x(4,7) means between 4 and 7 amino acids. This corresponds to .{2} and .{4,7} in Java notation. Curly braces are used to specify forbidden residues: {CG} means any residue other than C or G. The asterisk has its usual meaning.
Text to speech synthesis. Original motivation for grep. "For example, how do you cope with the digraph ui, which is pronounced many different ways: fruit, guile, guilty, anguish, intuit, beguine?"
Challenging regular expressions. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. any string except 11 or 111
2. every odd symbol is a 1
3. contains at least two 0s and at most one 1
4. no consecutive 1s
Binary divisibility. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
1. bit string interpreted as binary number is divisible by 3
2. bit string interpreted as binary number is divisible by 123
Boston accent. Write a program to replace all of the r's with h's to translate a sentence like "Park the car in Harvard yard" into the Bostonian version "Pahk the cah in Hahvahd yahd".
File extension. Write a program that takes the name of a file as a command line argument and prints out its file type extension. The extension is the sequence of characters following the last .. For example the file sun.gif has the extension gif. Hint: use split("\\."); recall that . is a regular expression meta-character, so you need to escape it.
Reverse subdomains. For web log analysis, it is convenient to organize web traffic based on subdomains like wayne.faculty.cs.princeton.edu. Write a program to read in a domain name and print it out in reverse order like edu.princeton.cs.faculty.wayne.
Bank robbery. You just witnessed a bank robbery and got a partial license plate of the getaway vehicle. It started with ZD, had a 3 somewhere in the middle and ended with V. Help the police officer write regular expression for this plate.
Regular expression for permutations. Find the shortest regular expression (using only the basic operations) you can for the set of all permutations on N elements for N = 5 or 10. For example if N = 3, then the language is abc, acb, bac, bca, cab, cba. Answer: difficult. Solution has length exponential in N.
Parsing quoted strings. Read in a text file and print out all quote strings. Use a regular expression like "[^"]*", but need to worry about escaping the quotation marks.
Parsing HTML. A >, optionally followed by whitespace, followed by a, followed by whitespace, followed by href, optionally followed by whitespace, followed by =, optionally followed by whitespace, followed by "http://, followed by characters until ", optionally followed by whitespace, then a <.
< \s* a \s+ href \s* = \s* \\"http://[^\\"]* \\" \s* >
Subsequence. Given a string s, determine whether it is a subsequence of another string t. For example, abc is a subsequence of achfdbaabgabcaabg. Use a regular expression. Now repeat the process without using regular expressions. Answer: (a) a.*b.*c.*, (b) use a greedy algorithm.
Huntington's disease diagnostic. The gene that causes Huntington's disease is located on chromosome 4, and has a variable number of repeats of the CAG trinucleotide repeat. Write a program to determine the number of repeats and print will not develop HD If the number of repeats is less than 26, offspring at risk if the number is 37-35, at risk if the number is between 36 and 39, and will develop HD if the number is greater than or equal to 40. This is how Huntington's disease is identified in genetic testing.
Gene finder. A gene is a substring of a genome that starts with the start codon (ATG), end with a stop codon (TAG, TAA, TAG, or TGA) and consists of a sequence of codons (nucleotide triplets) other than the start or stop codons. The gene is the substring in between the start and stop codons.
Repeat finder. Write a program Repeat.java that takes two command line arguments, and finds the maximum number of repeats of the first command line argument in the file specified by the second command line argument.
Character filter. Given a string t of bad characters, e.g. t = "!@#$%^&*()-_=+", write a function to read in another string s and return the result of removing all of the bad characters.
String pattern = "[" + t + "]"; String result = s.replaceAll(pattern, "");
Wildcard pattern matcher. Without using Java's built in regular expressions, write a program Wildcard.java to find all words in the dictionary matching a given pattern. The special symbol * matches any zero or more characters. So, for example the pattern "w*ard" matches the word "ward" and "wildcard". The special symbol . matches any one character. Your program should read the pattern as a command line parameter and the list of words (separated by whitespace) from standard input.
Wildcard pattern matcher. Repeat the previous exercise, but this time use Java's built in regular expressions. Warning: in the context of wildcards, * has a different meaning than with regular expressions.
Search and replace. Word processors allow you to search for all occurrences of a given query string and replace each with another replacement string. Write a program SearchAndReplace.java that takes two strings as command line inputs, reads in data from standard input, and replaces all occurrences of the first string with the second string, and sends the results to standard output. Hint: use the method String.replaceAll.
Password validator. Suppose that for security reasons you require all passwords to have at least one of the following characters
~ ! @ # $ % ^ & * |
Write a regular expression for use with String.matches that returns true if and only if the password contains one of the required characters. Answer: "^[^~!@#$%^&*|]+$"
Alphanumeric filter. Write a program Filter.java to read in text from standard input and eliminate all characters that are not whitespace or alpha-numeric. Answer here's the key line.
String output = input.replaceAll("[^\\s0-9a-zA-Z]", "");
Converting tabs to spaces. Write a program to convert all tabs in a Java source file to 4 spaces.
Parsing delimited text files. A popular way to store a database is in a text file with one record per line, and each field separated by a special character called the delimiter.
19072/Narberth/PA/Pennsylvania 08540/Princeton/NJ/New Jersey
Write a program Tokenizer.java that reads in two command line parameters, a delimiter character and the name of the file, and creates an array of tokens.
Parsing delimited text files. Repeat the previous exercise, but use the String library method split().
Checking a file format.
Misspellings. Write a Java program to verify that this list of common misspellings adapted from Wikipedia contains only lines of the form
misdemenors (misdemeanors) mispelling (misspelling) tennisplayer (tennis player)
where the first word is the misspelling and the string in parentheses is a possible replacement.
interesting English words
Size of DFA is exponential in size of RE. Give a RE for the set of all bitstrings whose kth to the last character equals 1. The size of the RE should be linear in k. Now, give a DFA for the same set of bitstrings. How many states does it use?
Hint: every DFA for this set of bitstrings must have at least 2^k states.