Micah Adler and I wrote an early paper on how to compress the Web graph . This paper intoduces the idea of using references to compress in this context, which is now used in many related works. I have also considered how to sample pages uniformly from the Web graph, using random walks on the Web.
The Internet is full of power laws. A historical survey that I wrote about power laws helps explains why, and points out that a great deal was known abougt power laws before computer scientists became interested in them. My own work in power laws includes a paper providing a model that explains the power law behavior of file sizes. Also, I have written a paper analyzing power law behavior in the model where monkeys type randomly correcting a confusion about the distribution in the literature. Anyone who enjoys monkeys should really read this paper. My most recent endeavor in this area has been to consider how to combine rankings from independent sources into a global ranking, which has applications to search and meta-search. I was introduced to this area by my esteemed collaborator Cynthia Dwork (Microsoft Research, San Jose). We should have a draft up shortly of our latest results.
Our work also led to analyses for low-density parity-check codes over the binary symmetric error channel. We developed the martingale-based argument for showing that the performance of randomly chosen LDPCs are concentrated around its expectation. Richardson and Urbanke later developed density evolution, leading to low-density parity-check codes that provably come very, very close to the Shannon capacity for basic channels. For this work, the six of us shared the 2002 IEEE Information Theory Society Paper Award.
My latest coding theory work includes extending these ideas to more complex channels with memory. I also introduced a new low-density parity-check code variation called Verification Codes that combine the benefits of the error and erasure codes developed previously. Finally, I have become interested in deletion channels, which do not seem to be well understood. There are variations of Verification Codes for deletion channels.
My primary contribution was to apply the technique of using limiting differential equations to study this and related load balancing problems. This technique proves useful for studying queueing variants: jobs come in as a Poisson process, and must choose service at one of n queues. Having each job choose the shortest queue is best, but requires centralization. Having each job choose the shortest of two randomly chosen queues performs extremely well, much better than just choosing a queue randomly. I applied differential equations to study this problem as well as many other variations.
One very interesting variation is the case where the load information obtained at the queues is stale information. Using the analysis techniques above, I showed that "herd behavior" can occur: incoming jobs can herd at a small number of apparently less-loaded queues before the load information is properly updated, causing oscillatory and inefficient loads. This behavior has in fact been seen in many real systems. Using the power of two choices can greatly mitigate this behavior.
Also on the practical side, Andrei Broder and I wrote a paper showing how to use the power of two choices to improve hash table lookups in IP routers. John Byers, Jeff Considine, and I showed how the power of two choices can be used in peer-to-peer networks such as Chord, and in other situations governed by an underlying geometry.
Encoded data allows one to do all kinds of interesting things, including download from multiple sites in parallel transparently. John Byers and I further extended this idea in our SIGCOMM 2002 paper, which explains how to use codes and Bloom filters to achieve informed content delivery across adaptive overlay networks.
Interactive, or human-in-the-loop, optimization systems leverage people's abilities in areas in which they outperform computers, such as visual and strategic thinking. Users can steer interactive optimization systems towards solutions which satisfy real-world constraints. Furthermore, people can better understand, justify, and modify solutions if they have participated in their construction.
The Human-Guided Search (HuGS) framework and Java toolkit we have developed allows rapid development of interactive optimization systems. The framework and code include visual metaphors for focusing and constraining optimization algorithms. The user can select from different algorithms, including a human-guidable version of a powerful heuristic, called tabu search. We have developed a wide variety of applications with the HuGS toolkit, including interactive systems to solve scheduling, vehicle routing, layout, and protein-folding problems. The HuGS system is described in a AAAI paper. Other specific papers resulting from this work cover protein folding and 2D strip packing.
For example, one problem that came up in class is the following: if I give you a collection of (ASCII text) files, it is easy to derive the optimal Huffman code for those files. Suppose instead that I allow you to make two distinct Huffman codes, and you can use the best of the pair on each file. How do you minimize the total compressed size of the collection? This problem comes up often in practice, but its complexity had apparently not been considered. I proved this problem was NP-hard.
Inspired by a paper co-written by my advisor, Alistair Sinclair, I worked on some problems on average case analysis of randomized bin packing. Susanne Albers and I showed that First Fit is unstable on an interesting distribution; Claire Kenyon and I showed that Best Fit is unstable on an interesting class of skewed distributions, making progress on a long-standing open question.
I think stochastic dominance is under-utilized in computer science. I have written papers on finding bounds for routing schemes, optimizing information aggregation schemes, and planning in transportation networks that use methods from the theory stochastic dominance.