3.5 Searching Applications

This section under major construction.

From the early days of computing, when symbol tables allowed programmers to progress from using numeric addresses in machine language to using symbolic names in assembly language, to modern applications of the new millennium, when symbolic names have meaning across worldwide computer networks, fast search algorithms have played and will play an essential role in computation. Modern applications for symbol tables include organization of scientific data, from searching for markers or patterns in genomic data to mapping the universe; organization of knowledge on the web, from searching in online commerce to putting libraries online; and implementing the internet infrastructure, from routing packets among machines on the web to shared file systems and video streaming.

Set APIs.

Some symbol-table clients do not need the values, just the ability to insert keys into a table and to test whether a key is in the table. Because we disallow duplicate keys, these operations correspond to the following API where we are just interested in the set of keys in the table.

To illustrate uses of SET.java, we consider filter clients that read a sequence of keys from the standard input and write some of them to standard output.

Dedup. Programs DeDup.java removes duplicates in the input stream.
Allowlist and blocklist filtering. Another classic that uses keys in a separate file to decide which keys from the input stream are passed to the output stream. Program AllowFilter.java implements an allowlist where any key that is in the file is passed through to the output and any key that is not in the file is ignored. Program BlockFilter.java implements a blocklist where any key that is in the file is ignored and any key that is is not in the file is passed through to the output.

Dictionary clients.

The most basic kind of symbol-table client builds a symbol table with successive put operations in order to support get requests. The following list of familiar examples illustrates the utility of this approach.

As a specific example, we consider a symbol-table client that you can use to look up information that is kept in file using the comma-separated-value (.csv) file format. LookupCSV.java builds a set of key-value pairs from a file of comma-separated values as specified on the command line and then prints out values corresponding to keys read from standard input. The command-line arguments are the file name and two integers, one specifying the field to serve as the key and the other specifying the field to serve as the value.

Indexing clients.

This application is another prototypical example of a symbol table client. We have a large amount of data and want to know where certain strings of interest occur. This seems to be associating multiple values with each key,but we are actually associating just one: a SET.

FileIndex.java takes a sequence of file names as command-line arguments and builds a symbol table associating every keyword with a SET of file names where the keyword can be found. Then, it takes keyword queries from standard input.

MovieIndex.java reads a data file of performers and movies.

Sparse vectors and matrices.

Program SparseVector.java implements a sparse vector using a symbol table of index-value pairs. Memory is proportional to the number of nonzeros. The set and get operations take log n time in the worst case; taking the dot product of two vectors takes time proportional to the number of nonzero entries in both.

System symbol table.

Java has several library functions for sets and symbol tables. Similar APIs, but you can insert null values into a symbol table.

TreeMap library uses a red-black tree. Guarantees log N performance per insert/search/delete. Their implementation maintains three pointers (two children and parent) for each node, whereas our implementation only stores two.
Sun's implementation of HashMap in Java 1.5 uses a hash table with separate chaining. The table size is a power of 2 (instead of a prime). This replaces a relatively expensive % M operation with an AND. Default load factor = 0.75. To guard against some poorly written hash functions, they apply the following scrambling routine to hashCode
static int hash(Object x) { int h = x.hashCode(); h += ~(h << 9); h ^= (h >>> 14); h += (h << 4); h ^= (h >>> 10); return h; }

Q + A

Q. When running performance benchmarks, what is a reasonable mix of insert, search, and delete operations?

A. It depends upon the application. The Java Collections framework are optimized for a mix of around 85% search/traverse, 14% insert/update, and 1% deletion.

Exercises

Creative Exercises

Concordance. Write an ST client Concordance.java that puts on standard output a concordance of the strings in the standard input stream.
Sparse matrices. Develop an API and an implementation for sparse 2D matrices. Support matrix addition and matrix multiplication. Include constructors for row and column vectors.
Solution: SparseVector.java and SparseMatrix.java.

Web Exercises

Modify FrequencyCount to read in a text file (comprised of UNICODE characters) and print out the alphabet size (number of distinct characters) and a table of characters and their frequencies, sorted in descending order of frequency.
Set intersection and union. Given two sets of strings, write a code fragment that computes a third set that contains those strings that appear in both sets (or either set). Hint: iterate over the elements in one set and check if they're in the other set.
Bidirectional symbol tables. Support put(key, value) and getByKey(key) or getByValue(value). Use two symbol tables behind the scenes. Ex: DNS and reverse DNS.
Highlighting browser hyperlinks. With each visited website, maintain the last time the site was visited so that you only highlight those sites that have been visited within the past month.
Frequency symbol table. Write an abstract data type FrequencyTable.java that supports the following operations: hit(Key), and count(Key). The hit operation increments the number of times the string appears by one. The count operation returns the number of times the given string appears, possibly 0. Applications: web counter, web log analyzer, music jukebox that counts number of times each song has been played....
Non-overlapping interval search. Given a list of non-overlapping intervals of integers (or dates), write a function that takes an integer argument and determines in which if any interval that values lies, e.g., if the intervals are 1643-2033, 5532-7643, 8999-10332, 5666653-5669321, then the query point 9122 lies in the third interval and 8122 lies in no interval.
Registrar scheduling. The Registrar at a prominent northeastern University recently scheduled an instructor to teach two different classes at the same exact time. Help the Registrar prevent future mistakes by describing a method to check for such conflicts. For simplicity, assume all classes run for 50 minutes starting at 9, 10, 11, 1, 2, or 3.
List. Implement the following list operations: size(), addFront(item), addBack(item), delFront(item), delBack(item), contains(item), delete(item), add(i, item), delete(i), iterator(). All operations should be efficient (logarithmic). Hint: use two symbol tables, one to find the ith element in the list efficiently, and the other to efficiently search by item. Java's List interface contains these methods, but does not supply any implementation that supports all ops efficiently.
Indirect PQ. Write a program IndirectPQ.java that implements an indirect PQ.
LRU cache. Create a data structure that supports the following operations: access and remove. The access operation inserts the item onto the data structure if it's not already present. The remove operation deletes and returns the item that was least recently accessed. Hint: maintain the items in order of access in a doubly linked list, along with pointers to the first and last nodes. Use a symbol table with keys = items, values = location in linked list. When you access an element, delete it from the linked list and re-insert it at the beginning. When you remove an element, delete it from the end and remove it from the symbol table.
UniQueue. Create a data type that is a queue, except that an element may only be inserted the queue once. Use an existence symbol table to keep track of all elements that have ever been inserted and ignore requests to re-insert such items.
Symbol table with random access. Create a data type that supports inserting a key-value pair, searching for a key and returning the associated value, and deleting and returning a random value. Hint: combine a symbol table and a randomized queue.
Correcting misspellings. Write a program that reads in text from standard input and replaces any commonly misspelled words with the suggested replacement, and prints the result to standard output. Use this list of common misspellings adapted from Wikipedia.
Move-to-front. Encoding: need rank query, delete, and insert. Decoding: need find ith, delete, and insert.
Mutable string. Create a data type that supports the following operations on a string: get(int i), insert(int i, char c), and delete(int i), where get returns the ith character of the string, insert inserts the character c and makes it the ith character, and delete deletes the ith character. Use a binary search tree.
Hint: Use a BST (with key = real number between 0 and 1, value = character) so that the inorder traversal of the tree yields the characters in the appropriate order. Use select() to find the ith element. When inserting a character at position i, choose the real number to be the average of the keys currently at positions i-1 and i.
Power method and largest eigenvalue. To compute the eigenvalue of largest magnitude (and corresponding eigenvector), use the power method. Under technical conditions (gap between largest two eigenvalues), it quickly converges to the right answer.
- Make initial guess x₁
- y_n = x_n / ||x_n||
- x_n+1 = A y_n
- λ = x_n+1^T y_n
- n = n + 1
If A is sparse, then this algorithm exploits sparsity. Example: Google PageRank.
Outerproduct. Add a method outer to Vector so that a.outer(b) returns the outer product of two length-N vectors a and b. The result is an N-by-N matrix.
Power law of web links. (Michael Mitzenmacher) The indegrees and outdegrees of World Wide Web obey a power law. Can be modeled by preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p < 1, it links to one of the existing pages, chosen uniformly random. With probability 1-p it links to an existing page with probability proportional to the number of incoming links of that page. This rule reflects the common tendency for new web pages to point to popular pages. Write a program to simulate this process and plot a histogram of the number of incoming links.
VMAs. BST used in Unix kernels for managing a set of virtual memory areas (VMAs). Each VMA represents a section of memory in a Unix process. VMAs vary in size from 4KB to 1GB. Also want to support range queries to determine which VMAs overlap a given range. Reference
Internet peer cache. Each IP packet sent by an Internet host is stamped with a 16-bit id that must be unique for that source-destination pair. Linux kernel uses an AVL tree indexed by IP address. Hashing would be faster, but want to avoid attacker sending IP packets with worst-case inputs. Reference
File indexing variants.
- Remove stopwords, e.g., a, the, on, of. Implement using another set.
- Support multi-word queries. This requires a set intersection operation. If you always intersect with the smallest set first, this takes time proportional to the size of the smallest set.
- Implement OR or other boolean logic.
- Record the position of word or number of occurrences of word in a document.
Arithmetic expression interpreter. Write a program Interpreter.java to parse and evaluate expressions of the following form.
>> x := 34 x := 34.0 >> y := 23 * x y := 782.0 >> z := x ^ y z := Infinity >> z := y ^ 2 z := 611524.0 >> x x := 34.0 >> x := sqrt 2 x := 1.4142135623730951
Variants.
- Add more more complicated expressions, e.g., z = 7 * (x + y * y), using conventional operator precedence.
- Add more error-checking and recovery.