I Multiple choice 42 points 1. C 2. A, D, F 3. D 4. A, B, C, D 5. None -- all run in user mode in a virtual machine environment 6. A 7. A, D, E 8. A, B, C 9. C 10. B, D =-=-=-=- II Short Answer (30) 1. wait channels have spinlocks instead of sleep locks. wait channels are managed in that you have a single structure that holds all the wait channels; 2. If you inadvertently attach an uninitialized block to a file, you might reveal confidential data to the new file's owner 3. If you let a VM change the hardware mappings directly, it could peak at other VM's memory 4. Because it's dependent on the trap frame, which is determined by the hardware registers 5. HW: could be faster SW: Gives more flexibility 6. Page coloring -- placing things in the address space to reduce thrashing The idea is that you assign colors to addresses that share something important, in this case, TLB entries. Then you allocate things likely to be used together to different colors. =-=-=-=-=- III. Stupid or Clever (50) 1. Stupid (doesn't work) 2. Clever (lots of different processors around) 3. Clever (kind of obvious, but a nice idea) 4. Stupid (now you have to figure out how to commit across logs) 5. One could argue either way: with buffering, it's not stupid, but it might not be clever; without buffering it would be stupid. =-=-=-=-=- IV: Quantum Drive (38) 1. (5) 100 (1 TB = 1000,000 1 MB writes). Sequentially, you can do 10,000 1 MB writes in a second, so you need 100 such streams to get to a million. 2. (9 points; 3 per challenge) * The destructive reads are a real problem -- this means you MUST replicate. If you replicate each object twice, then whenever you read something you must immediately write it back. I'm not sure 3-way replication helps much. * The disk works best when you can leverage parallelism and the writes are big so we really want to transfer things to the drive in BIG BIG BIG units * Related to the second one above, small writes are going to be very, very bad, so we probably want to avoid them at all cost. 3. (4 points) inode design Design rationale: This minimum 1 MB transfer size means a couple of things: It seems that this would suggest an LFS-like design OR an extent-based design with large extents. Let's see -- an Exabyte is 2^40 MB or a trillion extents. If we assume 3-way replication, that is still O(trillion) extents. So I think that you take files < 1 MB and pack them together into segments and you take files > 1 MB and represent them with 1 MB extents. Now, we also need to replicate inodes, because you destroy those when you read them. So here is what I'll do: For small files, do what Cedar did and write: bunch-of-data-blocks, ordered and sorted by file followed by a bunch of inodes followed by a summary Call that a segment and write it three times (striped across parallel units). For large files, we make the inode as large as it needs to be to hold all the extents. We write the extents (3 times each) and then pack these large file inodes into segments and write them too. And we write segment summaries too. Now, we have variable sized inodes, so they will store their size as a first element and we'll keep a map of locations of inodes, their sizes, and their multiple locations. This gets written 3 times on every checkpoint as well. In summary: Inodes are variable size Files smaller than 1 MB have a single extent Files larger have as many 1 MB extents as necessary Inodes are replicated 3 times Inodes get packed into segments Inode locations are tracked with a triply replicated map 4. (10 points) allocation policy The key observations are: 1. For big files, you want to exploit parallelism, so however the parallel writes can happen, you really want to make sure that a TB file gets striped over all those parallel units. 2. For small files, we need to pack them into large segments a la LFS. Of course, if you do that, then you need to deal with finding things I'm going to view my disk kind of like a large RAID0 device in that I will view the device as having 100 parallel devices (corresponding to the 100x parallelism computed in #1). Logically I will think of 3-4 of those logical devices as holding mapping information, but I will use RAID 5 style parity striping to avoid hotspots for the meta-data. (Although I want to triply store the data, I might want more copies of the meta-data because of those pesky destructive reads.) So, my allocation policy breaks down into: Triple replicate all data Big files: Stripe across parallel logical devices in 1 MB units. Small files: Pack together in 1 MB segments. Inodes: Pack them into segments; keep a map that gets written into metadata logical devices 5. (10 points) Robustness: I will triple replicate data and quadruple replicate metadata. I will replicate on a per 1 MB segment basis. Recovery: I am going to do a largely LFS like recovery. Take checkpoints which involves writing out the mapping data to mapping extents (let's assume I know how to find all the mapping extents). So, I start from the last checkpoint and read through one copy of the data that follows it, rewriting each segment after I read it. Then I take a checkpoint. =-=-=-=-=- V. Elastic OS (80) ==+== cs161 Paper Review Form ==-== DO NOT CHANGE LINES THAT START WITH "==+==" UNLESS DIRECTED! ==+== ===================================================================== ==+== Begin Review ==+== Reviewer: YOURNAME GOES HERE ==+== Paper #1 ==-== Title: Towards Elastic Operating Systems ==+== Review Readiness ==-== Enter "Ready" if the review is ready for others to see: Ready ==+== A. Overall merit ==-== Choices: 1. Reject ==-== 2. Weak reject ==-== 3. Weak accept ==-== 4. Accept ==-== 5. Strong accept ==-== Enter the number of your choice: 4 (5 points for taking a position that seems consistent with the review.) ==+== B. Reviewer expertise ==-== Choices: 1. No familiarity ==-== 2. Some familiarity ==-== 3. Knowledgeable ==-== 4. Expert ==-== Enter the number of your choice: 2 (5 points for proper humility) ==+== C. Paper summary # Write a 1-2 paragraph summary that explains this paper to those # committee members who do not read the paper. This paper presents a design for Elastic OS, which is a software layer that takes responsibility for providing resource elasticity transparently to the application. In particular, the framework is trying to relieve application developers of the burden of 1) writing their application in a granular fashion, 2) monitoring the infrastructure to know when to add/remove resources, 3) figure out how to distribute the workload, and 4) deal with shared state. The fundamental mechanism proposed is that of "elastic page tables," which extend traditional page tables to be able to refer to pages on remote nodes. Then, by sending execution threads to the processors where the memory is found, we avoid many of the complications of DSM. 20 points: Accurate summary -- highlights what the contributions are, gets technical details correct. 15 points: Either vague or minor flaws in description 10 points: A worthy effort, shows they read the paper, but not quite on target. ==+== D. Paper strengths # List a few (3-5) bullets outlining the strengths of this paper * Tackles a real problem -- designing applications that can scale seamlessly is challenging. * The small experiments in section 2 do a nice job of demonstrating the overheads and feasibility of the approach, before diving into technical details. * The overall approach is elegantly simple. 5 points per strength up to 15 points ==+== E. Paper weaknesses # List a few (3-5) bullets outlining the weaknesses of this paper * I would have liked a bit more technical detail about what is involved in the remote invocation. * Even though this is a hotOS paper, since the authors do have some prototype implementation, I would have liked to see a bit more in a demonstration of how changing some of the thressholds (e.g., when to pull and when so push. * I would have liked a few more words about how stretching is different from process migration. * I would have liked a few more details about how you migrate page tables during a stretch. 5 points per weakness up to 15 points ==+== F. Comments for author # Provide a more detailed discussion of this paper -- highlighting # things you thought were confusing, things you particularly liked # about the paper and things you did not particularly like. I would have liked more discussion about how you move page tables when you stretch a process to a new machine. Is the local machine in the page table auxilliary structure represented the same way that the remote machine is so that most entries are valid? You don't mention what a page table entry looks like for a page that is remote but has been swapped to disk (because in most cases you don't know), but when you first migrate a thread, you may, in fact, have pages on disk on the local node; how is this handled? I think I know what you mean by "putting complexity in user-space" ... "increases security," but the way it's phrased feels wrong. I think, avoiding putting complexity in the kernel largely improves security, but that's slightly different from adding complexity in user space (avoiding complexity altogether is a better strategy). What are the security implications of allowing the migration to take place at user level? It seems that somehow you're giving the management process access to page tables? How do you decide when to shrink a process? 20 points here: 20/15/10/0 largely as above ==+== G. Comments for PC (hidden from authors) # If there are things you wish to tell the program committee, but that # you do not wish the authors to see, write it here. ==+== End Review