CS 261 Final Project
Project Proposal & Research Plan Due: October 6, 2009
1st Status Meeting: Week of October 13, 2009
2nd Status Meeting: Week of November 13, 2009
Rough Draft Due: 5:00 PM November 24, 2009
In-class Presentations: November 24, 2009
Final Project Due 5:00 PM December 10, 2009
The goal of the final project is to provide the opportunity for you
to conduct systems research. The size of the project can
vary, but thinking of it as a conference paper is probably a good
model.
I will hand out a collection of documents that I call "The History
of a Paper".(Actually I will try to post them electronically, but
they are old and I may have to resort to paper.)
It includes an original submission and reviews (the submission was rejected),
another submission and reviews (the submission was accepted),
and then the final paper.
This will give you an idea of how to give and respond to criticism and
feedback.
It will also give you a sense of what I mean by "conference paper."
Final projects may be undertaken in teams of two (or more) students.
Teams of two can form without any consultation from me.
However, if you have a project that you
feel warrants more participation, please check with me first.
Projects may also be undertaken on cooperation with other graduate
courses, but any such project must be approved by the professors
of both courses. Not surprisingly, we expect more depth and work
for a project that is satisfying two class requirements.
Similarly, if you wish to undertake a project related to your own
research, I will permit it, but you must demonstrate how what we've
learned in CS261 influences your work and/or ways in which your research
would have been different had you not been also conducting a project
in CS261. In other words, your project in CS261 must extend work you
would normally have done in some new and/or diferent way.
For this project, you need to pose a question, design a framework
in which to answer the question, conduct the research, and write
up your experience and results. There will be five deliverables for
this project.
- Optimizing space utilization versus reproduction
Imagine that you could recreate any piece of information on your
disk.
If you could, your system decide dynamically and adaptively whether
to store the actual bytes of an object or merely store the "recipe
" for how to reproduce the data.
One representation consumes space, but not processing time; the
other consumes processing time, but less space.
Systems that capture provenance, such as our own PASS system,
often retain enough information to let you reproduce objects.
Using PASS as your substrate, design, implement, and evaluate a
system that dynamically and adaptively trades off disk space
for computation in the manner described here. You will need to
augment PASS to store execution times of processes (I think), but
once you have that, you can begin developing algorithms to decide
when to retain data and when to make it available via recomputation.
You will need to figure out exactly how to represent such objects
in the file system (that is, they should appear in the file system,
but when you open them, you may have to execute one or more processes
to create the actual data).
- Using Provenance to solve OS Problems
This is a generalization of the last project.
There are many systems papers of the form, "We wanted to solve
some problem, so we modified the kernel to produce a bunch of data,
and then we used that data to do something." I'd like to
see how many of these projects could be done simply by mining the
provenance data that our PASS system collects (You will want to
go read the PASS paper now, if this is at all interesting. You can
find it at: http://www.eecs.harvard.edu/syrah/pass/pubs/usenix06.pdf.)
- For example, prefetching files requires that you know what files
are likely to be accessed, before programs actually access them --
PASS captures much of that data. So, see if you can replicate the work in
"An Analytical Approach to File Prefetching (1997 USENIX)"
using PASS. Here are other papers on file prefetching to examine:
- Marginal Cost-Benefit Analysis for Predictive File Prefetching (ACSME 2003)
- Design and Implementation of Predictive File Prefetch ing (USENIX 2002)
-
Another area where provenance might be useful is in cache replacement
algorithms -- if you knew what you might need again soon, you would keep
it in your cache. Look for papers on caching, such as:
A study of integrated prefetching and caching strategies (Sigmetrics PER 1995).
- Informed prefetching and caching (SOSP 1995)
- Application controlled prefetching and caching (USENIX 2002)
- A PhD thesis by Somayaji (from UNM) showed that short sequences of
system calls can be used to "fingerprint" applications.
By noticing unusual system call sequences, he was able to perform intrusion
detection (and correction). Could provenance by used in a similar manner?
What kinds of fingerprints could we create/monitor?
- The Coda file system was designed to help users work in a disconnected
mode. One component of that system was a hoarding mechanism where the
system would try to figure out what files you were going to need
to function while disconnected. It seems that one could exploit provenance
to perform better hoarding. Do it!
- Any other piece of work that requires collecting data that we
already collect in PASS. Be creative!
Warning: I have a strong vested interest in this project. The upside is that
you are likely to get lots of attention; the downside is that you are likely
to get lots of attention.
But seriously, I would like to see multiple groups attack different aspects
of this problem, and then I'd like to write a BIG paper about all the
wonderful uses of provenance. So, if you're looking for a publication,
this is the project for you.
- Enhancing the User Experience via Thumbnails
Users are frustrated by the current experience of downloading,
browsing, and exchanging files in ad hoc, low bandwidth networks, such
as wireless networks. One improvement to this experience is to let
users first interact with local, lossy versions of files before
fetching complete files. These lossy versions, or \emph{thumbnails},
can be sent quickly over the network to augment traditional metadata
such as file names and ownership.
In low bandwidth environments, thumbnails must be generated at the
server. In prior work, an application program running on the server
generated thumbnails in response to users' requests. In contrast, you
might extend the underlying file system to support direct storage of
thumbnails, just as human-readable names and ownership are included
today. Making thumbnails first class objects
linked directly to their lossless version might enable consistency
guarantees and comparison of objects via thumbnails.
This project would extend prior work by a former student here.
Consult the following reference and then design, implement, and
evaluate a system that provides the services described therein.
Jonathan Ledlie,
File System Support for Low-Bandwidth Thumbnails,
Nokia Research Technical Report NRC-TR-2008-004, May 2008.
- Analyzing Data Distributions
We can gain access to the collection of hash values for every 4 KB block
in the National Software Reference Library (NSRL).
Using these hash values, you can compute the number of unique hashes,
the distribution of duplicate hash values, etc.
This seems like a rich source of interesting data.
Using this data, the following projects might be interesting.
- Cloud Architectures
Users are moving computation into the cloud, but it's not clear how
we should decompose services in a cloud-based world.
Devise a framework in which to ask the question, "What is the best
way to decompose services in a cloud-centric world?"
Then conduct experiments to address the question.
- Cloud OS API
Today's cloud offerings have been packaged as services.
If all computing moves to the cloud, perhaps the cloud ought to offer
a broader API, closer to that of an operating system.
Develop a framework in which to ask the question, "What is a reasonable
set of APIs that cloud providers should export?"
Figure out a way to actually evaluate the answer to this
question.
(This is still quite vague and will require some work to turn into
a concrete proposal.)
- The Future of Computing
It seems that there are two opposing models for what computing is
going to look like in the next decade.
One model suggests that we are moving to the cloud.
All data and computation will be performed remotely, on large
commercially run compute and storage servers (think Amazon Web
Services).
The other model suggests that you will carry your entire collection
of data around with you on a mobile device -- iPhone, BlackBerry, Treo.
That device will have significant computational power, but more importantly,
it will seamlessly and easily allow you to use any display on which
to work.
Explore the design space for these two models of computing.
The first is pretty easily specified today using existing services.
The second will require some work -- what research remains to be done?
How much of that work could you do this semester? How will you evaluate
the two models?
The outcome of this project ought to be a prediction with evidence to
back it up.
- How Relevant is System/161 for Systems' Research
Because it's simple inside, System/161 has proven a useful tool for
kernel prototyping. However, because of that same simplicity it
is difficult to evaluate kernels running on System/161. Evaluate
how performance under System/161 relates to performance on real
hardware. Absolute performance is less interesting than relative
performance: how well do speedups measured under System/161 relate to
speedups measured using real hardware? You might want to consider some
of the following questions:
- Is the lack of DMA support in System/161's system bus a major problem?
- Would it be worthwhile to switch to a better disk mod el
(e.g. CMU disksim)?
- Would it be worthwhile to add a processor cache modeler?
- Microkernels Reinvented
The ubiquity of multicore processors introduces the possibility that
the microkernel architecture, where OS services are provided by a set
of specialized cooperating processes might make sense again.
It would be interesting to explore this question.
You might take any of several different approaches:
- If we used a microkernel, could we use smaller cores and tiny
processes (e.g. use a 16-bit architecture that would save power
and facilitate many more cores on a chip)?
- Can we get better parallelism out of a microkernel than a
monolithic kernel on a chip with multiple thread contexts?
Does this perform better than some of the specialized OS
structures being proposed?
- Would a microkernel architecture be able to do a better job
of addressing the memory bottleneck?
Pick one or more question like this and figure out how to answer it.
You might find simulators such as SIMICS useful.
- Building an overlay to detect Censorship versus IT Failures
There is an interesting project at the Berkman center called,
Herdict that allows users to
notify them when it appears that sites are being blocked.
However, users reporting blockages have no way to differentiate
technical failures (e.g., Google is down for everyone) versus
true censorship.
However, imagine that you had a set of machines distributed around the world
and when someone reported a blocked site, you could proble the site from
a vareity of sources around the world in an attempt to differentiate
technical failures from true censorship.
Using Planetlab (a worldwide network of machines availble for research),
design, build, and evaluate an overlay for this purpose. Some of the
interesting questions is how you decide which nodes should probe a
potentially blocked site and how you interpret your results. Ideally
the actual construction of the overlay shouldn't require significant
research, but should be an educational and entertaining experience.
- Make Impressions Really work
Most of you who used the Impressions paper for your first assignment really liked
the paper and were extraordinarily disappointed with the tool itself. Let's fix
that (I agree that it would be a useful tool).
In particular, let's see if we can get the constraint problem working so you
create file system images of the right sizes. Also, I'd like to see if there
is a way that we could allow different kinds of distributions for the various
parameters rather than assuming that *all* file systems follow the same kinds
of distributions as the Microsoft ones. This would require analyzing a bunch
of file system disk images and creating distributions for them and then seeing
what the different distributions we want to support are. Finally, augmenting
the tool so that it would take data from a real file system and spit out an
Impressions parameter file (well documented) would be incredibly useful.
In other words, let's take all your criticisms and use them to generate a
really, really cool tool.