Michael Mitzenmacher: Research Summary

Overview of Interests

Algorithms for the Internet

Coding Theory

The Power of Two Choices/Balls and Bins

Applications of Digital Fountains

Human-Guided Search

Other work

Overview

My main area of interest is applied algorithms. More specifically, I look for problems in the intersection (and union) of algorithms, networks, and information theory. Fortunately, the explosion of the Internet has led to many intereting and important problems in these areas.

Algorithms for the Internet

Most of my research relates to designing algorithms for the Internet. Some of this work is described in the sections on Coding Theory, Applications of Digital Fountains, and The Power of Two Choices. Here I discuss other topics. I am a big fan of the Bloom filter -- they will have a prominent place in my textbook on randomized algorithms! A Bloom filter is a data structure designed to succinctly represent a set of objects so that you can queries of the form "Is this an element of the set?" very quickly. The price you pay for the gain in speed and space is that sometimes the data structure will tell you an element is in the set when it is not. That is, there are false positives. Whenever you want to use or send a set, but the space requirements are too high, you should think about using a Bloom filter instead. This data structure has recently been suggested for several network applications, as Andrei Broder and I detail in our survey of Bloom filters in network applications. I have shown that Bloom filters can be usefully compressed, correcting a misunderstanding that had plagued Bloom filters since their beginnings. Finally, I had the opportunity to combine codes and Bloom filters while working with John Byers and the people on the Shapeshifter project at Boston University. Our SIGCOMM 2002 paper discusses how to use codes and Bloom filters to achieve informed content delivery across adaptive overlay networks. A technical report gives details about a new data structure we developed that uses a Bloom filter for fast approximate reconciliation, a useful primitive in overlay networks.

Micah Adler and I wrote an early paper on how to compress the Web graph . This paper intoduces the idea of using references to compress in this context, which is now used in many related works. I have also considered how to sample pages uniformly from the Web graph, using random walks on the Web.

The Internet is full of power laws. A historical survey that I wrote about power laws helps explains why, and points out that a great deal was known abougt power laws before computer scientists became interested in them. My own work in power laws includes a paper providing a model that explains the power law behavior of file sizes. Also, I have written a paper analyzing power law behavior in the model where monkeys type randomly correcting a confusion about the distribution in the literature. Anyone who enjoys monkeys should really read this paper. My most recent endeavor in this area has been to consider how to combine rankings from independent sources into a global ranking, which has applications to search and meta-search. I was introduced to this area by my esteemed collaborator Cynthia Dwork (Microsoft Research, San Jose). We should have a draft up shortly of our latest results.

Coding Theory

I co-developed Tornado codes, a particularly efficient type of low-density parity-check code (LDPC) for erasures, with Michael Luby, Amin Shokrollahi, and Dan Spielman. These ideas have been further developed and commercialized by the company Digital Fountain. Digital fountains are discussed further below.

Our work also led to analyses for low-density parity-check codes over the binary symmetric error channel. We developed the martingale-based argument for showing that the performance of randomly chosen LDPCs are concentrated around its expectation. Richardson and Urbanke later developed density evolution, leading to low-density parity-check codes that provably come very, very close to the Shannon capacity for basic channels. For this work, the six of us shared the 2002 IEEE Information Theory Society Paper Award.

My latest coding theory work includes extending these ideas to more complex channels with memory. I also introduced a new low-density parity-check code variation called Verification Codes that combine the benefits of the error and erasure codes developed previously. Finally, I have become interested in deletion channels, which do not seem to be well understood. There are variations of Verification Codes for deletion channels.

The Power of Two Choices/Balls and Bins

My Ph.D. thesis was on the power of two choices, a theme most clearly elucidated in earlier work by Azar, Broder, Karlin, and Upfal. The basic result is the following: if n balls are thrown independently and uniformly at into n bins, the maximum number of balls in a bin, also called the maximum load, is (1 + o(1)) log n / log log n with high probability. If instead the balls are placed sequentially, and each ball is given d >= 2 choices selected independently and uniformly at random from the n bins, the maximum load is log log n / log d + O(1) with high probabilitty. Just two choices completely changes the behavior of the maximum load, reducing it dramatically.

My primary contribution was to apply the technique of using limiting differential equations to study this and related load balancing problems. This technique proves useful for studying queueing variants: jobs come in as a Poisson process, and must choose service at one of n queues. Having each job choose the shortest queue is best, but requires centralization. Having each job choose the shortest of two randomly chosen queues performs extremely well, much better than just choosing a queue randomly. I applied differential equations to study this problem as well as many other variations.

One very interesting variation is the case where the load information obtained at the queues is stale information. Using the analysis techniques above, I showed that "herd behavior" can occur: incoming jobs can herd at a small number of apparently less-loaded queues before the load information is properly updated, causing oscillatory and inefficient loads. This behavior has in fact been seen in many real systems. Using the power of two choices can greatly mitigate this behavior.

Also on the practical side, Andrei Broder and I wrote a paper showing how to use the power of two choices to improve hash table lookups in IP routers. John Byers, Jeff Considine, and I showed how the power of two choices can be used in peer-to-peer networks such as Chord, and in other situations governed by an underlying geometry.

Applications of Digital Fountains

When we designed Tornado codes, we had in mind a better way of doing multicast. We developed an abstraction of the properties of encoded data that we call a digital fountain. Data from a digital fountain is like water-- you don't care how much water you get, you just care when your cup is full. Our first application of this abstraction was to multicast, which led us to consider congestion control mechansims to make encoded multicast schemes TCP-friendly. All of this work, or improved variations, live on in products by Digital Fountain, the company, and have appeared in IETF drafts.

Encoded data allows one to do all kinds of interesting things, including download from multiple sites in parallel transparently. John Byers and I further extended this idea in our SIGCOMM 2002 paper, which explains how to use codes and Bloom filters to achieve informed content delivery across adaptive overlay networks.

Human-Guided Search

I serve as an algorithmic consultant for the Human-Guided Search project at Mitsubishi Electronic Research Labs. The project is led by the truly outstanding Neal Lesh, and more information can be found on the Human-Guided Search project page.

Interactive, or human-in-the-loop, optimization systems leverage people's abilities in areas in which they outperform computers, such as visual and strategic thinking. Users can steer interactive optimization systems towards solutions which satisfy real-world constraints. Furthermore, people can better understand, justify, and modify solutions if they have participated in their construction.

The Human-Guided Search (HuGS) framework and Java toolkit we have developed allows rapid development of interactive optimization systems. The framework and code include visual metaphors for focusing and constraining optimization algorithms. The user can select from different algorithms, including a human-guidable version of a powerful heuristic, called tabu search. We have developed a wide variety of applications with the HuGS toolkit, including interactive systems to solve scheduling, vehicle routing, layout, and protein-folding problems. The HuGS system is described in a AAAI paper. Other specific papers resulting from this work cover protein folding and 2D strip packing.

Other Work

One of the joys of being a theoretician is that occasionally you have the luxury of spending time tackling problems not because they are part of a larger agenda, but simply because they seem fun and you might learn something.

For example, one problem that came up in class is the following: if I give you a collection of (ASCII text) files, it is easy to derive the optimal Huffman code for those files. Suppose instead that I allow you to make two distinct Huffman codes, and you can use the best of the pair on each file. How do you minimize the total compressed size of the collection? This problem comes up often in practice, but its complexity had apparently not been considered. I proved this problem was NP-hard.

Inspired by a paper co-written by my advisor, Alistair Sinclair, I worked on some problems on average case analysis of randomized bin packing. Susanne Albers and I showed that First Fit is unstable on an interesting distribution; Claire Kenyon and I showed that Best Fit is unstable on an interesting class of skewed distributions, making progress on a long-standing open question.

I think stochastic dominance is under-utilized in computer science. I have written papers on finding bounds for routing schemes, optimizing information aggregation schemes, and planning in transportation networks that use methods from the theory stochastic dominance.