CS 264. Peer-to-Peer Systems: Project Ideas

Bandwidth measurement on PlanetLab

In the SBON paper we encountered the notion of a latency space as calculated by the Vivaldi algorithm where each node successively refines its coordinates in the latency space through periodic measurements to other random nodes in the overlay. Is it possible to do something similar for a bandwidth space? Applications that run over a large testbed such as PlanetLab often need accurate estimates of the bandwidth between nodes. Using N squared measurements (to measure bw between all pairs of nodes) is expensive. Can we predict bandwidth based on a few measurements to a few other random nodes? This project would involve surveying and understanding existing bandwidth measurement techniques, using these techniques to get a good data set of bandwidth measurements on PlanetLab, analyzing the data and possibly proposing a method (for example, a bandwidth coordinate space) that allows nodes participating in an overlay like PlanetLab to predict the bandwidth to another node in the overlay based on a few measurments done to other nodes in the coordinate space.

Dynamic Network Coordinates

In looking at latency space, how often do network coordinates change? Coordinates that change back and forth between two nodes might indicate node instability, an underlying BGP route flap, or even a bug in your measurement algorithm. This project involves studying the dynamism in network coordinates over PlanetLab. You would measure a few weeks worth of network coordinates on PlanetLab, track the movement of coordinates in the system, and implement ways to allow a user to visualize the data. For example, if you need to visualize how 3-D latency coordinate space evolves over time (that is, a 4-D data set), you could graph the 3-D latency coordinates as a movie and show the time component through movement of the 3-D coordinate space. If you have interest in graphics, data visualization, and/or statistics, this is the project for you. You can use the Vivaldi simulator to get started.

Real-time news filtering over an SBON

Use the SBON to implement and deploy on PlanetLab or on the SYRAH group blades a real-time news filtering application. The application would fetch news feeds in realtime and process/filter the news feeds as directed by a user query (e.g., the user may be interested in news feeds that match a set of keywords). The project involves building operators to fetch, filter, and potentially store news feeds at overlay nodes, building a web interface to allow users to access the data returned, and potentially looking at how to re-use operators for multiple simultaneously running queries on the overlay, (i.e., study how to efficiently identify and leverage overlap between filters needed for different queries). For this project as well as the projects listed above, we have some code and/or in-house SBON experts to help you get started. :-)

Exploring Space-filling Curves

In a large distributed system, finding a node at or near a specific location can be hard -- especially if you want to do it efficiently. Several research projects (including the Hourglass project at Harvard) have begun approaching this problem in the following way: a structured overlay (or DHT) provides a one-dimensional space; let us imagine that locations (coordinates) were one-dimensional and also let us assume that each node knows its own location; nodes can "post" their locations into the DHT; a query for a node at or near a coordinate would perform an "expanding" query centered at this spot on the DHT; walking away from this spot in both directions along the ring, it would soon find the nearest node, and voila, we are done. Now, how can this be achieved if we drop the assumption that coordinates are one-dimensional? Answer: use a space-filling curve to map multi-dimensional coordinates to one-dimensional coordinates with the hope that points that are nearby each other in one space are nearby each other in the other. The Hourglass group currently uses a Hilbert curve to do this and has found it works reasonably well.

There are at least two venues for exploration in this project:


Evaluation: What makes a experimenting with a particular network testbed valid?

There exist a plethora of testbeds for performing experiments for p2p and other networking protocols. A standard of the networking community for years has been ns2, but more recently Narses, Emulab/Netbed, PlanetLab, and finally Modelnet have been introduced. What are the advantages and disadvantages of each and how faithfully do results from one carry over to results on another (and into real life!!)? Ideally, this project would consist of sets of micro- and macrobenchmarks that would expose the intended (and unintended) consequences of the testbeds' assumptions and operating environment. For example, narses juxtaposes itself as a scalable alternative to ns2, but what and how much does it lose in this attempt to scale. Seeing that it simply needs more nodes on which to run interesting experiments, Emulab/Netbed has recently introduced "virtual nodes," which allow a user to divide up the physical computers it provides as part of its experimentation environment. How does the traffic shaping they provide compare to that of ns2 and Narses? Finally, experimentation on PlanetLab is notorious unreproducable; does that make it inherently invalid or simply realistic?

Pick a topology, any topology!

Mercator, transit-stub, power-law and, the ever popular, random! As readers, we are presented results with only a hint of reason as to why a particular topology has been chosen for a set of experiments. Of course, when we question the first topology, there are always one or two others, similarly pulled out of thin air, on which author's system performs well. One need only look as far as the Pastry and CAN papers to see how little information we are given about the choice of a particular topology. Too frequently we are offered the argument "that's how it was done in the related work" when the related work hasn't explained the choice of topology either. Existing topology generators include GT-ITM (transit-stub), BRITE, Inet, and using network measurements from RON, PlanetLab, Caida. The goal of this project would be to synthesize what exists in the world of topology generators and present it in a format understandable, meaningful, and useful to the p2p community. The SYRAH group would ask you to present your work at one of our meetings at the end of the semester, and your work would have the potential to be cited in a wide variety of System, networking, and p2p contexts.

Survey of P2P and Ad Hoc Routing Protocols (done in Spring 2005)

Schollmeier has noted that there is a good deal of overlap between the routing protocols that have been proposed in the p2p and ad hoc routing communities, but, because these communities have their own journals and conferences, few are aware of the overlap and even fewer have benefitted from the potential for cross-pollinating research. The goal of this project would be to explore the similarities and differences in the various routing protocols that have been proposed and implemented. Are the different protocols more concerned with scale, fault-tolerance, trust, churn, memory usage and other state, etc.? What potential exists for cross-pollination? For this project, you could choose to do an in-depth survey with minimal implementation, a theoretical analysis of a spectrum of protocols, a simulation of a set of them, or an implementation of a few of your choice.

Neighbor/Finger Selection

There has been some work in the structured overlay/DHT world examining the impact of choosing one's neighbors out of a potential spectrum. The base Pastry algorithm incorporates the idea of choosing one's most proximate neighbor out of the range of potential entries, thereby reducing stretch. For example, if node p's ID is 1xxx and b=16, the entry for column 0 can be filled with any node whose ID begins with 0. On average, this is 1/16th of all nodes in the system. This means that there exist a broad spectrum of node to choose for this one spot. Of course, the same logic (and choice) applies to all of the other slots in the first row of p's routing table (except row 1, where it doesn't need an entry). The same kind of logic has been applied to the Chord finger table, allowing a Chord node to choose a neighbor from a wide spectrum, improving some characteristic like latency while keeping the same number of expected hops (see LPRS-Chord in Sigmetrics 2003).

Previous research has examined choosing the node with the lowest latency for a given spot and this yields good improvements in latency. What if other characteristics were added to this consideration, aside from just latency? Would a more stable system develop? For example, what if one choose the most available and high bandwidth nodes, those that could support the addtional in-degree? Issues like these have not been previously explored in the p2p context and could lead to some strong results.

New routing algorithms

We don't have a lot to say on this topic, but if you have a really new algorithm and you want to explore it in the context of this class, that would certainly work as a project. This could be done as a theoretical analysis, through simulation, or experimentation, or, even better, as a solid combination of these, depending on your bent. If you are thinking of doing a project like this, definitely look at the Gummadi paper " The Impact of DHT Routing on Resilience and Proximity" first. A similar type of comparison paper would certainly be interesting and valid if it explored different axes than the Gummadi paper.

Anonymization in Bit Torrent

As a New York Times article on BitTorrent pointed out, users of BitTorrent are currently not anonymous. That is, it is possible to see who is sending a file to whom. Can you change the BitTorrent algorithm to utilize bandwidth in its current very efficient manner and make the protocol anonymous?

BitTorrent free-riding

There has been a LOT of talk about how BitTorrent is vulnerable to selfish free-riding. Show empirically that BitTorrent, in its current state, can be defeated or taken advantage of by free-riders. How many free-riders does it take to make the system not worth using for peers that are cooperative and well-behaved?

Peer-to-peer tracing for fun and profit

First file system traces and now p2p traces have proved invaluable in performing realistic research. Throughout the 1990s endless file system papers cited the work of Ousterhout and Baker as evidence to support their Systems research. In the p2p world, much measurement and analysis has been based on the work of Saroiu (MMCN 2002) and Gummadi (SOSP 2003). The Saroiu traces of Gnutella from 2001 has been cited by approximately one hundred published research papers in the past two years, even though their measurement method exhibits a significant flaw (time outs being assumed to be churn). The more recent Gummadi work, which studied Kazaa, will never be released according to the authors because they have been prevented from doing so by University of Washington lawyers.

The Saroiu trace includes per node upstream and downstream bandwidth information, per node joins and leaves for both ping responses and application responses, and number and size of files shared. It has several notable problems: (1) It lacks physical topology information. Thus, researchers who use the trace use topology generators like Georgia Tech-ITM and BRITE to create a topology and glue the two parts together. This leads to incomparable results, even when two groups say they are both using the Gnutella traces. Thus, being able to capture some of this topology information and give researchers a full set of information on which to base their work would be invaluable. (2) As part of their tracing methodology, if a node did not respond to a ping after 20 seconds, it was marked down. This too-short timeout yields a higher churn rate than really occurred. (3) Gnutella itself has changed since 2001; for instance, a similar 3-day trace this year by Blake et al. (HotOS 2003) found 10x the number of nodes, and different network characteristics. Your goal would be to collect and analyze a new and more accurate p2p trace, and publish this for researchers to use.

As a starting point, when Jonathan Ledlie (last year's TF) last spoke with Saroiu, he agreed to give us some part of his software as a starting point. Another research group, at Northwestern, has another tracing tool that might also work as a starting point.

Incentives in p2p

For those with a Parkesian leaning, we put this topic in here as a placeholder for you.

Reputation in p2p networks

Prove or disprove that reputation systems work in the context of p2p networks.

Probabalistic algorithms over DHT p2p networks

Proposed by Francis Crick, Spring 2004

These are just really, really, cool and are very powerful. Just about any problem on a DHT benefits from a little probabalistic activity,