CS 261 Final Project

Project Proposal & Research Plan Due: October 6, 2009

1st Status Meeting: Week of October 13, 2009

2nd Status Meeting: Week of November 13, 2009

Rough Draft Due: 5:00 PM November 24, 2009

In-class Presentations: November 24, 2009

Final Project Due 5:00 PM December 10, 2009

The goal of the final project is to provide the opportunity for you to conduct systems research. The size of the project can vary, but thinking of it as a conference paper is probably a good model. I will hand out a collection of documents that I call "The History of a Paper".(Actually I will try to post them electronically, but they are old and I may have to resort to paper.) It includes an original submission and reviews (the submission was rejected), another submission and reviews (the submission was accepted), and then the final paper. This will give you an idea of how to give and respond to criticism and feedback. It will also give you a sense of what I mean by "conference paper."

Final projects may be undertaken in teams of two (or more) students. Teams of two can form without any consultation from me. However, if you have a project that you feel warrants more participation, please check with me first. Projects may also be undertaken on cooperation with other graduate courses, but any such project must be approved by the professors of both courses. Not surprisingly, we expect more depth and work for a project that is satisfying two class requirements. Similarly, if you wish to undertake a project related to your own research, I will permit it, but you must demonstrate how what we've learned in CS261 influences your work and/or ways in which your research would have been different had you not been also conducting a project in CS261. In other words, your project in CS261 must extend work you would normally have done in some new and/or diferent way.

For this project, you need to pose a question, design a framework in which to answer the question, conduct the research, and write up your experience and results. There will be five deliverables for this project.

  1. Project Proposal and Research Plan (20%)
  2. The proposal part should be a single page that describes your project. You should clearly motivate and state the research question you are investigating. Provide a few sentences of explanation about why you think this is an interesting question, why it is important, and how it qualifies as research.

    The research plan is a more comprehensive document. It should include the following components (the numbers in parentheses are an indication of an estimate of the number pages that you might need for the section).

  3. Status Meeting
  4. I encourage you to come to office hours and talk to me about your project or schedule other meetings with me as the need arises. At a minimum, I want to meet with you twice before the extended abstract is due. One of those meetings must happen before the end of the week of October 13 and the other must happen before the end of the week of November 13. Note that I am traveling October 25-28 and that University accreditation (for which I am the chair) will happen October 19-21, so I may not be as available as I normally would be. These meetings are for your benefit. I expect to answer questions you may have, ask you questions about what you've done, brainstorm about what to do next, etc. If you haven't done anything, you will get little value out of these meetings. Come with questions I can help you answer.

  5. Rough Draft (30%)
  6. This is the version of the paper that we will review at the "Mock Program Committee." You should have all (or at least most) of your research completed by this point. The rough draft should contain all the parts of the paper, although it may have prepliminary results and may be missing some results. It should contain a complete introduction, background, description of your research, related work, and (preliminary) conclusions. It should contain the entire structure of the results section, even if there are still some missing results. You should be able to write significant parts of this immediately after your project proposal is turned in, so please, please, please don't write this all the night before it is due. The better and more complete the rough draft, the more valuable the input that I and your classmates can give you.

  7. In-class Presentation (10%)
  8. Each group will present a short talk on their research during class on November 24. (Depending on the number of groups, we might have to schedule additional time.) You should plan on approximately a 10 minute presentation and 2-5 minutes for questions and answers. You can think of the in-class presentation as being a short conference presentation. You will not have time to present all the details and subtleties of your work, but you should be able to motivate the audience and explain the important results of your work. After your presentation, your classmates should want to read your final report. The presentation is a great way to make sure that they understand what you're trying to do, so that there is no confusion when they read/review your project.

  9. Final Report (40%)
  10. The final report is a research paper. I expect that most reports will be approximately 10 - 15 "Conference pages" including graphs, tables, diagrams and reference. You should complete the writing early enough that you have time to reread your work and critique it with the rigor that you applied to Assignment 1. Be honest. State shortcomings in your work. Discuss follow on projects. I expect that several of these reports will be suitable for submission to a conference (the USENIX Annual Technical Conference deadline is in January), and I will be happy to work with you to turn them into submissions.

    Part of your final report grade will be based upon how well you address comments raised by the program committee. Do not ignore my and the reviewer comments!

Project Suggestions

I suggest some topics below (we may add to this list after it is up on the web site, so it makes sense to check there if you are stuck for project ideas). You need not pick your final project from this list, but if you decide on a project not on this list, please check with me before fully committing to the project. The key characteristics of a project should be:

  1. The work can reasonably be completed in two months.
  2. We have the required hardware and software in-house to enable you to conduct the necessary research.
  3. The research question has something to do with operating systems (I'm willing to give a fair bit of freedom here, but if there are any questions, please check with me).
  4. The project is structured in such a way that you can have tangible results. (No big idea papers probably.)
  5. You will learn something from undertaking this project.

  1. Optimizing space utilization versus reproduction
  2. Imagine that you could recreate any piece of information on your disk. If you could, your system decide dynamically and adaptively whether to store the actual bytes of an object or merely store the "recipe " for how to reproduce the data. One representation consumes space, but not processing time; the other consumes processing time, but less space.

    Systems that capture provenance, such as our own PASS system, often retain enough information to let you reproduce objects.

    Using PASS as your substrate, design, implement, and evaluate a system that dynamically and adaptively trades off disk space for computation in the manner described here. You will need to augment PASS to store execution times of processes (I think), but once you have that, you can begin developing algorithms to decide when to retain data and when to make it available via recomputation. You will need to figure out exactly how to represent such objects in the file system (that is, they should appear in the file system, but when you open them, you may have to execute one or more processes to create the actual data).

  3. Using Provenance to solve OS Problems
  4. This is a generalization of the last project.

    There are many systems papers of the form, "We wanted to solve some problem, so we modified the kernel to produce a bunch of data, and then we used that data to do something." I'd like to see how many of these projects could be done simply by mining the provenance data that our PASS system collects (You will want to go read the PASS paper now, if this is at all interesting. You can find it at: http://www.eecs.harvard.edu/syrah/pass/pubs/usenix06.pdf.)

    1. For example, prefetching files requires that you know what files are likely to be accessed, before programs actually access them -- PASS captures much of that data. So, see if you can replicate the work in "An Analytical Approach to File Prefetching (1997 USENIX)" using PASS. Here are other papers on file prefetching to examine:
      • Marginal Cost-Benefit Analysis for Predictive File Prefetching (ACSME 2003)
      • Design and Implementation of Predictive File Prefetch ing (USENIX 2002)
    2. Another area where provenance might be useful is in cache replacement algorithms -- if you knew what you might need again soon, you would keep it in your cache. Look for papers on caching, such as:
        A study of integrated prefetching and caching strategies (Sigmetrics PER 1995).
      • Informed prefetching and caching (SOSP 1995)
      • Application controlled prefetching and caching (USENIX 2002)
    3. A PhD thesis by Somayaji (from UNM) showed that short sequences of system calls can be used to "fingerprint" applications. By noticing unusual system call sequences, he was able to perform intrusion detection (and correction). Could provenance by used in a similar manner? What kinds of fingerprints could we create/monitor?
    4. The Coda file system was designed to help users work in a disconnected mode. One component of that system was a hoarding mechanism where the system would try to figure out what files you were going to need to function while disconnected. It seems that one could exploit provenance to perform better hoarding. Do it!
    5. Any other piece of work that requires collecting data that we already collect in PASS. Be creative!

    Warning: I have a strong vested interest in this project. The upside is that you are likely to get lots of attention; the downside is that you are likely to get lots of attention.

    But seriously, I would like to see multiple groups attack different aspects of this problem, and then I'd like to write a BIG paper about all the wonderful uses of provenance. So, if you're looking for a publication, this is the project for you.

  5. Enhancing the User Experience via Thumbnails
  6. Users are frustrated by the current experience of downloading, browsing, and exchanging files in ad hoc, low bandwidth networks, such as wireless networks. One improvement to this experience is to let users first interact with local, lossy versions of files before fetching complete files. These lossy versions, or \emph{thumbnails}, can be sent quickly over the network to augment traditional metadata such as file names and ownership.

    In low bandwidth environments, thumbnails must be generated at the server. In prior work, an application program running on the server generated thumbnails in response to users' requests. In contrast, you might extend the underlying file system to support direct storage of thumbnails, just as human-readable names and ownership are included today. Making thumbnails first class objects linked directly to their lossless version might enable consistency guarantees and comparison of objects via thumbnails.

    This project would extend prior work by a former student here. Consult the following reference and then design, implement, and evaluate a system that provides the services described therein. Jonathan Ledlie, File System Support for Low-Bandwidth Thumbnails, Nokia Research Technical Report NRC-TR-2008-004, May 2008.

  7. Analyzing Data Distributions
  8. We can gain access to the collection of hash values for every 4 KB block in the National Software Reference Library (NSRL). Using these hash values, you can compute the number of unique hashes, the distribution of duplicate hash values, etc. This seems like a rich source of interesting data. Using this data, the following projects might be interesting.

  9. Cloud Architectures
  10. Users are moving computation into the cloud, but it's not clear how we should decompose services in a cloud-based world. Devise a framework in which to ask the question, "What is the best way to decompose services in a cloud-centric world?" Then conduct experiments to address the question.

  11. Cloud OS API
  12. Today's cloud offerings have been packaged as services. If all computing moves to the cloud, perhaps the cloud ought to offer a broader API, closer to that of an operating system. Develop a framework in which to ask the question, "What is a reasonable set of APIs that cloud providers should export?" Figure out a way to actually evaluate the answer to this question. (This is still quite vague and will require some work to turn into a concrete proposal.)

  13. The Future of Computing
  14. It seems that there are two opposing models for what computing is going to look like in the next decade. One model suggests that we are moving to the cloud. All data and computation will be performed remotely, on large commercially run compute and storage servers (think Amazon Web Services). The other model suggests that you will carry your entire collection of data around with you on a mobile device -- iPhone, BlackBerry, Treo. That device will have significant computational power, but more importantly, it will seamlessly and easily allow you to use any display on which to work.

    Explore the design space for these two models of computing. The first is pretty easily specified today using existing services. The second will require some work -- what research remains to be done? How much of that work could you do this semester? How will you evaluate the two models?

    The outcome of this project ought to be a prediction with evidence to back it up.

  15. How Relevant is System/161 for Systems' Research
  16. Because it's simple inside, System/161 has proven a useful tool for kernel prototyping. However, because of that same simplicity it is difficult to evaluate kernels running on System/161. Evaluate how performance under System/161 relates to performance on real hardware. Absolute performance is less interesting than relative performance: how well do speedups measured under System/161 relate to speedups measured using real hardware? You might want to consider some of the following questions:

  17. Microkernels Reinvented
  18. The ubiquity of multicore processors introduces the possibility that the microkernel architecture, where OS services are provided by a set of specialized cooperating processes might make sense again. It would be interesting to explore this question. You might take any of several different approaches: Pick one or more question like this and figure out how to answer it. You might find simulators such as SIMICS useful.

  19. Building an overlay to detect Censorship versus IT Failures
  20. There is an interesting project at the Berkman center called, Herdict that allows users to notify them when it appears that sites are being blocked. However, users reporting blockages have no way to differentiate technical failures (e.g., Google is down for everyone) versus true censorship.

    However, imagine that you had a set of machines distributed around the world and when someone reported a blocked site, you could proble the site from a vareity of sources around the world in an attempt to differentiate technical failures from true censorship.

    Using Planetlab (a worldwide network of machines availble for research), design, build, and evaluate an overlay for this purpose. Some of the interesting questions is how you decide which nodes should probe a potentially blocked site and how you interpret your results. Ideally the actual construction of the overlay shouldn't require significant research, but should be an educational and entertaining experience.

  21. Make Impressions Really work
  22. Most of you who used the Impressions paper for your first assignment really liked the paper and were extraordinarily disappointed with the tool itself. Let's fix that (I agree that it would be a useful tool).

    In particular, let's see if we can get the constraint problem working so you create file system images of the right sizes. Also, I'd like to see if there is a way that we could allow different kinds of distributions for the various parameters rather than assuming that *all* file systems follow the same kinds of distributions as the Microsoft ones. This would require analyzing a bunch of file system disk images and creating distributions for them and then seeing what the different distributions we want to support are. Finally, augmenting the tool so that it would take data from a real file system and spit out an Impressions parameter file (well documented) would be incredibly useful. In other words, let's take all your criticisms and use them to generate a really, really cool tool.