5.2 Tries
This section under major construction.
Symbol tables with string keys.
Could use standard symbol table implementation. Instead, exploit additional structure of string keys. Customized searching algorithms for strings (and other keys represented as digits). Goal: as fast as hashing, more flexible than binary search trees. Can efficiently support additional operations including prefix and wildcard matching, e.g., IP routing table wants to forward to 128.112.136.12, instead forwards to 128.112 which is longest matching prefix that it knows about. Side benefit: fast and space-efficient string searching.R-way tries. Program TrieST.java implements a string symbol table using a multiway trie.
Ternary search tries. Program TST.java implements a string symbol table using a ternary search trie.
Reference: Fast Algorithms for Sorting and Searching by Bentley and Sedgewick.
Property A. (Bentley-Sedgewick) Given an input set, the number of nodes in its TST is the same, regardless of the order in which the strings are inserted.
Pf. There is a unique node in the TST for each distinct string prefix in the set. The relative positions of the nodes within the TST can change depending on the insertion order, but the number of nodes is invariant.
Advanced operations.
Wildcard search, prefix match. The r-way trie and TST implementations include code for wildcard matching and prefix matching.Lazy delete = change the word boundary bit. Eager delete = clean up any dead parent links.
Application: T9 text input for cell phones. User types using phone pad keys; system displays all words that correspond (and auto-completes as soon as it is unique). If user types 0, system displays all possible auto-completions.
Q+A
Exercises
- Write nonrecursive versions of an R-way trie string set and a TST.
- Unique substrings of length L. Write a program that reads in text from standard input and calculate the number of unique substrings of length L that it contains. For example, if the input is cgcgggcgcg then there are 5 unique substrings of length 3: cgc, cgg, gcg, ggc, and ggg. Applications to data compression. Hint: use the string method substring(i, i + L) to extract ith substring and insert into a symbol table. Alternative solution: compute the hash of the i+1st substring using the hash of the ith substring. Test it out on the first million digits of π. or the first 10 million digits of π.
- Unique substrings. Write a program that reads in text from standard input and calculates the number of distinct substrings of any length. (Can do very efficiently with a suffix tree.)
- Document similarity. To determine the similarity of two documents, calculate the number of occurrences of each trigram (3 consecutive letters). Two documents are similar if the Euclidean distance between the frequency vector of trigrams is small.
- Spell checking. Write a program SpellChecker.java that the name of a file containing a dictionary of words in the English language, and then reads string from standard input and prints out any word that is not in the dictionary. Use a string set.
- Spam blocklist. Insert known spam email addresses into an existence table and use to blocklist spamm.
- IP lookup by country. Use the data file ip-to-country.csv to determine what country a given IP address is coming from. The data file has five fields (begining of IP address range, ending of IP address range, two character country code, three character country code, and country name. See The IP-to-country website. The IP addresses are non-overlapping. Such a database tool can be used for: credit card fraud detection, spam filtering, auto-selection of language on a web site, and web server log analysis.
- Inverted index of web. Given a list of web pages, create a symbol table of words contained in the web pages. Associate with each word a list of web pages in which that word appears. Write a program that reads in a list of web pages, creates the symbol table, and support single word queries by returning the list of web pages in which that query word appears.
- Inverted index of web. Extend the previous exercise so that it supports multi-word queries. In this case, output the list of web pages that contain at least one occurrence of each of the query words.
- Symbol table with duplicates.
- Password checker. Write a program that reads in a string from the command line and a dictionary of words from standard input, and checks whether it is a "good" password. Here, assume "good" means that it (i) is at least 8 characters long, (ii) is not a word in the dictionary, (iii) is not a word in the dictionary followed by a digit 0-9 (e.g., hello5), (iv) is not two words separated by a digit (e.g., hello2world)
- Reverse password checker. Modify the previous problem so that (ii) - (v) are also satisfied for reverses of words in the dictionary (e.g., olleh and olleh2world). Clever solution: insert each word and its reverse into the symbol table.
- Random phone numbers. Write a program that takes a comand line input N and prints N random phone numbers of the form (xxx) xxx-xxxx. Use a symbol table to avoid choosing the same number more than once. Use this list of area codes to avoid printing out bogus area codes. Use an R-way trie.
- Contains prefix. Add a method containsPrefix() to StringSET takes a string s as input and return true if there is a string in the set that contains s as a prefix.
- Substring matches. Given a list of (short) strings, your goal is to support queries where the user looks up a string s and your job is to report back all strings in the list that contain s. Hint: if you only want prefix matches (where the strings have to start with s), use a TST as described in the text. To support substring matches, insert the suffixes of each word (e.g., string, tring, ring, ing, ng, g) into the TST.
- Zipf's law. Harvard linguist George Zipf observed that the frequency of the ith most common word in an English text containing N words is roughly proporitional to 1/i, where the constant of proportionality is 1 + 1/2 + 1/3 + ... + 1/N. Test "Zipf's law by reading in a sequence of words from standard input, tabulate their frequencies, and compare against the predicted frequencies.
- Typing monkeys and power laws. (Micahel Mitzenmacher) Suppose that a typing monkey creates random words by appending each of 26 possible lettter with probability p to the current word, and finishes the word with probability 1 - 26p. Write a program to estimate the frequency distribution of the lengths of words produced. If "abc" is produced more than once, only count it once.
- Typing monkeys and power laws.
Repeat the previous exercise, but assume that the letters a-z occur
proportional to the following probabilities, which are typical of
English text.
CHAR FREQ CHAR FREQ CHAR FREQ CHAR FREQ CHAR FREQ A 8.04 G 1.96 L 4.14 Q 0.11 V 0.99 B 1.54 H 5.49 M 2.53 R 6.12 W 1.92 C 3.06 I 7.26 N 7.09 S 6.54 X 0.19 D 3.99 J 0.16 O 7.60 T 9.25 Y 1.73 E 12.51 K 0.67 P 2.00 U 2.71 Z 0.09 F 2.30
- Indexing a book.
Write a program that reads in a text file from standard input and compiles
an alphabetical index of which words appear on which lines, as in
the following input. Ignore case and puncuation.
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, age 3-4 best 1 foolishness 4 it 1-4 of 1-4 the 1-4 times 1-2 was 1-4 wisdom 4 worst 2
- Entropy. We define the relative entropy of a text corpus with N words, k of which are distinct as E = 1 / (N log N) * sum (pi log(k) - log(pi), i = 1..k) where p_i is the fraction of times that word i appears. Write a program that reads in a text corpus and prints out the relative entropy. Convert all letters to lowercase and treat punctuation marks as whitespace.
- Longest prefix.
True or false. The longest prefix of a binary string x that is a key in
a symbol table is either the floor of x
or the ceiling of x (or both if x is in the set).
False. The longest prefix of 1100 in { 1, 10, 1011, 1111 } is 1, not 1011 or 1111.
Creative Exercises