CS 264. Peer-to-Peer
Systems: Project Ideas
Bandwidth measurement on PlanetLab
In the SBON paper we encountered the notion of a latency space as
calculated by the Vivaldi algorithm where each node successively
refines its coordinates in the latency space through periodic
measurements to other random nodes in the overlay. Is it possible to
do something similar for a bandwidth space? Applications that run
over a large testbed such as PlanetLab often need accurate estimates
of the bandwidth between nodes. Using N squared measurements (to
measure bw between all pairs of nodes) is expensive. Can we predict
bandwidth based on a few measurements to a few other random nodes?
This project would involve surveying and understanding existing
bandwidth measurement techniques, using these techniques to get a good
data set of bandwidth measurements on PlanetLab, analyzing the data
and possibly proposing a method (for example, a bandwidth coordinate
space) that allows nodes participating in an overlay like PlanetLab to
predict the bandwidth to another node in the overlay based on a few
measurments done to other nodes in the coordinate space.
Dynamic Network Coordinates
In looking at latency space, how often do network coordinates change?
Coordinates that change back and forth between two nodes might
indicate node instability, an underlying BGP route flap, or even a bug
in your measurement algorithm. This project involves studying the
dynamism in network coordinates over PlanetLab. You would measure a
few weeks worth of network coordinates on PlanetLab, track the
movement of coordinates in the system, and implement ways to allow a
user to visualize the data. For example, if you need to visualize how
3-D latency coordinate space evolves over time (that is, a 4-D data
set), you could graph the 3-D latency coordinates as a movie and show
the time component through movement of the 3-D coordinate space. If
you have interest in graphics, data visualization, and/or statistics,
this is the project for you. You can use the Vivaldi simulator to get
started.
Real-time news filtering over an SBON
Use the SBON to implement and deploy on PlanetLab or on the SYRAH
group blades a real-time news filtering application. The application
would fetch news feeds in realtime and process/filter the news feeds
as directed by a user query (e.g., the user may be interested in news
feeds that match a set of keywords). The project involves building
operators to fetch, filter, and potentially store news feeds at
overlay nodes, building a web interface to allow users to access the
data returned, and potentially looking at how to re-use operators for
multiple simultaneously running queries on the overlay, (i.e., study
how to efficiently identify and leverage overlap between filters
needed for different queries). For this project as well as the projects
listed above, we have some code and/or in-house SBON experts to help
you get started. :-)
Exploring Space-filling Curves
In a large distributed system, finding a node at or near a specific
location can be hard -- especially if you want to do it efficiently.
Several research projects (including the
Hourglass
project at Harvard) have begun approaching this problem in the
following way: a structured overlay (or DHT) provides a
one-dimensional space; let us imagine that locations (coordinates)
were one-dimensional and also let us assume that each node knows its
own location; nodes can "post" their locations into the DHT; a query
for a node at or near a coordinate would perform an "expanding" query
centered at this spot on the DHT; walking away from this spot in both
directions along the ring, it would soon find the nearest node, and
voila, we are done. Now, how can this be achieved if we drop the
assumption that coordinates are one-dimensional? Answer: use a
space-filling curve to map multi-dimensional coordinates to
one-dimensional coordinates with the hope that points that are nearby
each other in one space are nearby each other in the other. The
Hourglass group currently uses a Hilbert curve to do this and has
found it works reasonably well.
There are at least two venues for exploration in this project:
-
Of the four projects using space-filling curves for similar
purposes (see references below), all are using Hilbert curves.
Even though many other flavors of space-filling curves exist and
some may well surpass Hilbert's proximity properties, there does
not exist a detailed analytic or experimental comparison in the
literature. The goal of this project would be to perform such a
comparison.
-
While we noted above that these one-dimensional coordinates could
be stored in a DHT and queried using an expanding search, such an
infrastructure does not exist. The goal of this project would be
to take an existing DHT, Bamboo, and add on this search capability
and evaluate its effectiveness through a large-scale deployment on
PlanetLab (most likely using BAM).
References:
Evaluation: What makes a experimenting with a particular network testbed valid?
There exist a plethora of testbeds for performing experiments for p2p
and other networking protocols. A standard of the networking
community for years has been
ns2, but more recently
Narses,
Emulab/Netbed,
PlanetLab, and finally
Modelnet have been introduced.
What are the advantages and disadvantages of each and how faithfully
do results from one carry over to results on another (and into real
life!!)? Ideally, this project would consist of sets of micro- and
macrobenchmarks that would expose the intended (and unintended)
consequences of the testbeds' assumptions and operating environment.
For example, narses juxtaposes itself as a scalable
alternative to ns2, but what and how much does it lose in this
attempt to scale. Seeing that it simply needs more nodes on which to
run interesting experiments, Emulab/Netbed has recently introduced
"virtual nodes," which allow a user to divide up the physical
computers it provides as part of its experimentation environment. How
does the traffic shaping they provide compare to that of ns2 and
Narses? Finally, experimentation on PlanetLab is notorious
unreproducable; does that make it inherently invalid or simply
realistic?
Pick a topology, any topology!
Mercator, transit-stub, power-law and, the ever popular, random! As
readers, we are presented results with only a hint of reason as to why
a particular topology has been chosen for a set of experiments. Of
course, when we question the first topology, there are always one or
two others, similarly pulled out of thin air, on which author's system
performs well. One need only look as far as the Pastry and CAN papers
to see how little information we are given about the choice of a
particular topology. Too frequently we are offered the argument
"that's how it was done in the related work" when the related work
hasn't explained the choice of topology either. Existing topology
generators include GT-ITM
(transit-stub), BRITE, Inet, and using
network measurements from RON, PlanetLab, Caida. The goal of this project
would be to synthesize what exists in the world of topology generators
and present it in a format understandable, meaningful, and useful to
the p2p community. The SYRAH group would ask
you to present your work at one of our meetings at the end of the
semester, and your work would have the potential to be cited in a wide
variety of System, networking, and p2p contexts.
Survey of P2P and Ad Hoc Routing Protocols (done in Spring 2005)
Schollmeier has noted that there is a good deal of overlap between the routing
protocols that have been proposed in the p2p and ad hoc routing
communities, but, because these communities have their own journals and
conferences, few are aware of the overlap and even fewer have
benefitted from the potential for cross-pollinating research. The goal
of this project would be to explore the similarities and differences
in the various routing protocols that have been proposed and
implemented. Are the different protocols more concerned with scale,
fault-tolerance, trust, churn, memory usage and other state, etc.?
What potential exists for cross-pollination? For this project, you
could choose to do an in-depth survey with minimal implementation, a
theoretical analysis of a spectrum of protocols, a simulation of a set
of them, or an implementation of a few of your choice.
Neighbor/Finger Selection
There has been some work in the structured overlay/DHT world examining
the impact of choosing one's neighbors out of a potential spectrum.
The base Pastry algorithm incorporates the idea of choosing one's most
proximate neighbor out of the range of potential entries, thereby
reducing stretch. For example, if node p's ID is 1xxx and b=16, the
entry for column 0 can be filled with any node whose ID begins with
0. On average, this is 1/16th of all nodes in the system. This means
that there exist a broad spectrum of node to choose for this one
spot. Of course, the same logic (and choice) applies to all of the
other slots in the first row of p's routing table (except row 1, where
it doesn't need an entry). The same kind of logic has been applied to
the Chord finger table, allowing a Chord node to choose a neighbor
from a wide spectrum, improving some characteristic like latency while
keeping the same number of expected hops (see LPRS-Chord in Sigmetrics
2003).
Previous research has examined choosing the node with the lowest
latency for a given spot and this yields good improvements in
latency. What if other characteristics were added to this
consideration, aside from just latency? Would a more stable system
develop? For example, what if one choose the most available and high
bandwidth nodes, those that could support the addtional in-degree?
Issues like these have not been previously explored in the p2p context
and could lead to some strong results.
New routing algorithms
We don't have a lot to say on this topic, but if you have a really new
algorithm and you want to explore it in the context of this class,
that would certainly work as a project. This could be done as a
theoretical analysis, through simulation, or experimentation, or, even
better, as a solid combination of these, depending on your bent. If
you are thinking of doing a project like this, definitely look at the
Gummadi paper " The Impact of DHT Routing on Resilience and Proximity"
first. A similar type of comparison paper would certainly be
interesting and valid if it explored different axes than the Gummadi
paper.
Anonymization in Bit Torrent
As a New York Times
article
on
BitTorrent pointed out, users of BitTorrent are currently not anonymous. That is, it is possible to see
who is sending a file to whom. Can you change the BitTorrent
algorithm to utilize bandwidth in its current very
efficient manner and make the protocol anonymous?
BitTorrent free-riding
There has been a LOT of talk about how BitTorrent is vulnerable to selfish
free-riding. Show empirically that BitTorrent, in its current state, can
be defeated or taken advantage of by free-riders. How many free-riders does
it take to make the system not worth using for peers that are cooperative
and well-behaved?
Peer-to-peer tracing for fun and profit
First file system traces and now p2p traces have proved invaluable in
performing realistic research. Throughout the 1990s endless file
system papers cited the work of Ousterhout and Baker as evidence to
support their Systems research. In the p2p world, much measurement
and analysis has been based on the work of Saroiu (MMCN
2002) and Gummadi (SOSP
2003).
The Saroiu traces of Gnutella from 2001 has been cited by approximately
one hundred published research papers in the past two years,
even though their measurement method exhibits a
significant flaw (time outs being assumed to be churn). The more
recent Gummadi work, which studied Kazaa, will never be released
according to the authors because they have been prevented from doing
so by University of Washington lawyers.
The Saroiu trace
includes
per node upstream and downstream bandwidth information,
per node joins and leaves for both ping responses and application responses,
and number and size of files shared.
It has several notable problems:
(1) It lacks physical topology information. Thus, researchers who use the
trace use topology generators like Georgia Tech-ITM and BRITE to
create a
topology and glue the two parts together. This leads to incomparable
results, even when two groups say they are both using the Gnutella
traces. Thus, being able to capture some of this topology information
and give researchers a full set of information on which to base their
work would be invaluable.
(2) As part of their tracing methodology, if a node did not respond to a
ping
after 20 seconds, it was marked down. This too-short timeout yields a
higher churn rate than really occurred.
(3) Gnutella itself has changed since 2001; for instance, a similar 3-day
trace this year by Blake et al. (HotOS 2003) found 10x the number of
nodes, and different network characteristics.
Your goal would be to collect and analyze a new and more accurate p2p
trace, and publish this for researchers to use.
As a starting point, when Jonathan Ledlie (last year's TF) last spoke with Saroiu, he
agreed to give us some part of his software as a starting point.
Another research group, at Northwestern,
has another tracing tool that might also work as a starting point.
Incentives in p2p
For those with a Parkesian leaning, we put this topic in here as a
placeholder for you.
Reputation in p2p networks
Prove or disprove that reputation systems work in the context of p2p networks.
Probabalistic algorithms over DHT p2p networks
Proposed by Francis Crick, Spring 2004
These are just really, really, cool and are very powerful. Just about
any problem on a DHT benefits from a little probabalistic activity,