3.5 Searching Applications
This section under major construction.
From the early days of computing, when symbol tables allowed
programmers to progress from using numeric addresses in machine language to using
symbolic names in assembly language, to modern applications of the new millennium,
when symbolic names have meaning across worldwide computer networks, fast search
algorithms have played and will play an essential role in computation. Modern
applications for symbol tables include organization of scientific data, from searching for
markers or patterns in genomic data to mapping the universe; organization of knowledge on the
web, from searching in online commerce to putting libraries online;
and implementing the internet infrastructure, from routing packets
among machines on the web to shared file systems and video streaming.
Set APIs.
Some symboltable clients do not need the values, just the ability to insert keys into a table and to test whether a key is in the table. Because we disallow duplicate keys, these operations correspond to the following API where we are just interested in the set of keys in the table.To illustrate uses of SET.java, we consider filter clients that read a sequence of keys from the standard input and write some of them to standard output.
 Dedup. Programs DeDup.java removes duplicates in the input stream.
 Whitelist and blacklist. Another classic that uses keys in a separate file to decide which keys from the input stream are passed to the output stream. Program WhiteList.java implements a whitelist where any key that is in the file is passed through to the output and any key that is not in the file is ignored. Program BlackFilter.java implements a blacklist where any key that is in the file is ignored and any key that is is not in the file is passed through to the output.
Dictionary clients.
The most basic kind of symboltable client builds a symbol table with successive put operations in order to support get requests. The following list of familiar examples illustrates the utility of this approach.
As a specific example, we consider a symboltable client that you can use to look up information that is kept in file using the commaseparatedvalue (.csv) file format. LookupCSV.java builds a set of keyvalue pairs from a file of commaseparated values as specified on the command line and then prints out values corresponding to keys read from standard input. The commandline arguments are the file name and two integers, one specifying the field to serve as the key and the other specifying the field to serve as the value.
Indexing clients.
This application is another prototypical example of a symbol table client. We have a large amount of data and want to know where certain strings of interest occur. This seems to be associating multiple values with each key,but we are actually associating just one: a SET.
MovieIndex.java reads a data file of performers and movies.
Sparse vectors and matrices.
Program SparseVector.java implements a sparse vector using a symbol table of indexvalue pairs. Memory is proportional to the number of nonzeros. The operations set and get take O(1) time (on average); taking the dot product of two vectors takes time proportional to the number of nonzero entries in both.System symbol table.
Java has several library functions for sets and symbol tables. Similar APIs, but you can insert null values into a symbol table. TreeMap library uses a redblack tree. Guarantees log N performance per insert/search/delete. Their implementation maintains three pointers (two children and parent) for each node, whereas our implementation only stores two.

Sun's implementation of HashMap in Java 1.5
uses a hash table with separate chaining.
The table size
is a power of 2 (instead of a prime). This replaces
a relatively expensive % M operation with an AND.
Default load factor = 0.75.
To guard against
some poorly written hash functions, they apply the following
scrambling routine to hashCode
static int hash(Object x) { int h = x.hashCode(); h += ~(h << 9); h ^= (h >>> 14); h += (h << 4); h ^= (h >>> 10); return h; }
Q + A
Q. When running performance benchmarks, what is a reasonable mix of insert, search, and delete operations?
A. It depends upon the application. The Java Collections framework are optimized for a mix of around 85% search/traverse, 14% insert/update, and 1% deletion.
Exercises
Creative Exercises
 Concordance. Write an ST client Concordance.java that puts on standard output a concordance of the strings in the standard input stream.
 Sparse matrices.
Develop an API and an implementation for sparse 2D matrices. Support matrix
addition and matrix multiplication. Include constructors for
row and column vectors.
Solution: SparseVector.java and SparseMatrix.java.
Web Exercises
 Modify FrequencyCount to read in a text file (comprised of UNICODE characters) and print out the alphabet size (number of distinct characters) and a table of characters and their frequencies, sorted in descending order of frequency.
 Set intersection and union. Given two sets of strings, write a code fragment that computes a third set that contains those strings that appear in both sets (or either set). Hint: iterate over the elements in one set and check if they're in the other set.
 Bidirectional symbol tables. Support put(key, value) and getByKey(key) or getByValue(value). Use two symbol tables behind the scenes. Ex: DNS and reverse DNS.
 Highlighting browser hyperlinks. With each visited website, maintain the last time the site was visited so that you only highlight those sites that have been visited within the past month.
 Frequency symbol table. Write an abstract data type FrequencyTable.java that supports the following operations: hit(Key), and count(Key). The hit operation increments the number of times the string appears by one. The count operation returns the number of times the given string appears, possibly 0. Applications: web counter, web log analyzer, music jukebox that counts number of times each song has been played....
 Nonoverlapping interval search. Given a list of nonoverlapping intervals of integers (or dates), write a function that takes an integer argument and determines in which if any interval that values lies, e.g., if the intervals are 16432033, 55327643, 899910332, 56666535669321, then the query point 9122 lies in the third interval and 8122 lies in no interval.
 Registrar scheduling. The Registrar at a prominent northeastern University recently scheduled an instructor to teach two different classes at the same exact time. Help the Registrar prevent future mistakes by describing a method to check for such conflicts. For simplicity, assume all classes run for 50 minutes starting at 9, 10, 11, 1, 2, or 3.
 List. Implement the following list operations: size(), addFront(item), addBack(item), delFront(item), delBack(item), contains(item), delete(item), add(i, item), delete(i), iterator(). All operations should be efficient (logarithmic). Hint: use two symbol tables, one to find the ith element in the list efficiently, and the other to efficiently search by item. Java's List interface contains these methods, but does not supply any implementation that supports all ops efficiently.
 Indirect PQ. Write a program IndirectPQ.java that implements an indirect PQ.
 LRU cache. Create a data structure that supports the following operations: access and remove. The access operation inserts the item onto the data structure if it's not already present. The remove operation deletes and returns the item that was least recently accessed. Hint: maintain the items in order of access in a doubly linked list, along with pointers to the first and last nodes. Use a symbol table with keys = items, values = location in linked list. When you access an element, delete it from the linked list and reinsert it at the beginning. When you remove an element, delete it from the end and remove it from the symbol table.
 UniQueue. Create a data type that is a queue, except that an element may only be inserted the queue once. Use an existence symbol table to keep track of all elements that have ever been inserted and ignore requests to reinsert such items.
 Symbol table with random access. Create a data type that supports inserting a keyvalue pair, searching for a key and returning the associated value, and deleting and returning a random value. Hint: combine a symbol table and a randomized queue.
 Correcting misspellings. Write a program that reads in text from standard input and replaces any commonly misspelled words with the suggested replacement, and prints the result to standard output. Use this list of common misspellings adapted from Wikipedia.
 Movetofront. Encoding: need rank query, delete, and insert. Decoding: need find ith, delete, and insert.
 Mutable string.
Create a data type that supports the following operations on a string:
get(int i), insert(int i, char c),
and delete(int i), where get returns the ith
character of the string, insert inserts the character c
and makes it the ith character, and delete deletes the
ith character. Use a binary search tree.
Hint: Use a BST (with key = real number between 0 and 1, value = character) so that the inorder traversal of the tree yields the characters in the appropriate order. Use select() to find the ith element. When inserting a character at position i, choose the real number to be the average of the keys currently at positions i1 and i.
 Power method and largest eigenvalue.
To compute the eigenvalue of largest magnitude (and corresponding eigenvector),
use the power method. Under technical conditions (gap between largest
two eigenvalues), it quickly converges to the right answer.
 Make initial guess x_{1}
 y_{n} = x_{n} / x_{n}
 x_{n+1} = A y_{n}
 λ = x_{n+1}^{T} y_{n}
 n = n + 1
 Outerproduct. Add a method outer to Vector so that a.outer(b) returns the outer product of two lengthN vectors a and b. The result is an NbyN matrix.
 Power law of web links. (Michael Mitzenmacher) The indegrees and outdegrees of World Wide Web obey a power law. Can be modeled by preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p < 1, it links to one of the existing pages, chosen uniformly random. With probability 1p it links to an existing page with probability proportional to the number of incoming links of that page. This rule reflects the common tendency for new web pages to point to popular pages. Write a program to simulate this process and plot a histogram of the number of incoming links.
 VMAs. BST used in Unix kernels for managing a set of virtual memory areas (VMAs). Each VMA represents a section of memory in a Unix process. VMAs vary in size from 4KB to 1GB. Also want to support range queries to determine which VMAs overlap a given range. Reference
 Internet peer cache. Each IP packet sent by an Internet host is stamped with a 16bit id that must be unique for that sourcedestination pair. Linux kernel uses an AVL tree indexed by IP address. Hashing would be faster, but want to avoid attacker sending IP packets with worstcase inputs. Reference
 File indexing variants.
 Remove stopwords, e.g., a, the, on, of. Implement using another set.
 Support multiword queries. This requires a set intersection operation. If you always intersect with the smallest set first, this takes time proportional to the size of the smallest set.
 Implement OR or other boolean logic.
 Record the position of word or number of occurrences of word in a document.
 Arithmetic expression interpreter.
Write a program Interpreter.java
to parse and evaluate expressions of the following form.
>> x := 34 x := 34.0 >> y := 23 * x y := 782.0 >> z := x ^ y z := Infinity >> z := y ^ 2 z := 611524.0 >> x x := 34.0 >> x := sqrt 2 x := 1.4142135623730951
Variants.
 Add more more complicated expressions, e.g., z = 7 * (x + y * y), using conventional operator precedence.
 Add more errorchecking and recovery.