Computer Science 161 (2014)

Final Exam

24 hours between May 6, 2014 and May 10, 2014 PM

240 minutes/240 points

Instructions

The points allocated to each problem correspond to how much time we think the problem should take. Although you have 24 hours in which to take the exam, we do not recommend that you pull an all-nighter and spend 24 hours completing the exam.

All questions should be directed to me (Margo) via email. I will check mail periodically, but do not expect an instant response. I strongly encourage you to read through the exam and fire off any questions that immediately come to mind. I will check mail regularly, but will not be watching constantly. When in doubt, document assumptions.

Here is my advice on how to complete this exam:

Please send your answers to me in a PLAIN TEXT file as an attachment (in a perfect world, your text file would not have lines longer than 80 characters, but I'll reformat it if it does). Label the answers clearly so it is obvious what answers correspond to what questions. Leave space between each question. Please place your answers in order in the file. Send your completed exam to margo@eecs.harvard.edu. I will return the graded exam to you; I tend to engage in a dialog as much as possible, so there is some educational value in reading more than just the final number. This exam is open text book, open source code, and open notes. You may not use computers (cell phones, PDAs, or any other device that can talk to a network) during the exam (other than to record your answers, to view OS/161 or System/161 code, or to examine your notes or the course web site). Please do not access the Piazza group during the exam. You must complete the exam on your own.

Exam Policy Statement

Please do not forget to do this!

as stated above, please write the following statement and then place your name after it, asserting that you have followed the rules of the exam. "I have neither given nor received help on this exam and have followed the rules as stated." If this statement is absent, you will hear from me.

1. Retake Concept Inventory (5 points)

This is a free 5 points -- all you need to is complete the concept inventory that you took at the beginning of the year to the best of your ability. You will be graded for completion, not based upon the correctness of your answers, but if you do worse than you did in February, we'll be sad.

2. OS Abstractions (45 points)

The main learning object of this coures is for you to be able to explain how an operating system provides the abstractions with which programmers are familiar. To that end, for each of the three items listed below, explain to a smart student who has completed CS50, is reasonably comfortable programming to the POSIX API, but has not taken CS61 (nor CS161) and doesn't know how the operating system provides each abstraction. (More concretely, your audience knows how to program, has heard of VM but has no idea what it means, knows that computers consist of a processor, memory, and a disk, and knows how to write programs in C that use POSIX system calls like fork, exec, open, close, read, write.)

Please limit your answer to roughly 100-200 words (I won't count, but I want more than 3 sentences, and less than a page).

  1. Processes
    A process is the abstraction used to make your program run on a computer. You can think of a process as consisting of a big box, which we call the address space. The box contains all the memory (data) that your program can touch; it also includes the program itself. In addition to your box, you get some active little thing that moves around inside the box executing instructions -- think of it as a really well trained insect -- it knows how to read an instruction and then do what the instruction says, but it can't get out of the box. When it needs to do something that the insect isn't allowed to do (e.g., read from a disk), it makes a really loud noise (system call), that tells you (the operating) system that there is something to do. You look in the box and are given information describing what the insect wants you to do and where to get any additional information and then you do it for the little insect.
  2. Virtual Memory
    Let's continue with our box analogy. Remember in Harry Potter how they could have tents that looked normal size on the outside but were ginormous on the inside? This is kind of how your box works. On the inside it looks like you have a huge amount of memory at your disposal (usually about 2 or 4 GB; sometimes a lot more). However, your machine might not have that much memory. VM is what makes that illusion work. In hardware there is some mechanism or mechanisms (TLBs and page tables) and on every instruction they translate an address in your huge box to a real address in the memory. The OS does a quick slight of hand - -if the box needs data and the real memory is full, it picks something to take out of real memory (and leaves a note for itself) and then finds the item needed and puts it in the real memory.
  3. Files
    Computers store data on disks in blocks, which are typically chunks of about 4 KB. The operating system builds an index structure for your file that provides the place on disk where each part of your file lives. For example, if your file is 12 KB, the this index structure will say something like, "block 0 is at address A, block 1 is at address B, and block 2 is at address C." Directories work kind of the same way -- the directory is stored as a file, but its contents are structured in a specific manner. You can think of a directory as containing a array of structures, where each structure contains a number that uniquely identifies a file and the name (relative to the directory) corresponding to that file.

3. Pot Pourri (35 points)

  1. (15 points) We know that operating systems enforce security on files by means of access control lists and capabilities, but operating systems must be security conscious about other things as well. Think about the three A's, and list 3 things an operating system does to provide security. For each item, if possible, describe which of the A's it addresses.
    Here are five for fun.
    1. Virtual memory is a security mechanism; it protects one process from reading another's address space. A virtual address is like a capability for a memory location, so that's an authorization mechanism. The access enforcement is both in the HW and in the VM system that manages page tables and TLBs. The authentication mechanism kicks in when a process is created and given the credentials associated with a particular user
    2. Login: Forcing users to log in lets you associate an identity with each session/process, etc. The login mechanism itself is an authentication action.
    3. Argument checking of system calls: we copy user process data into the kernel and do some amount of error checking. This is really a kind of access enforcement in that we are trying to prevent user processes from handing data to the OS that would cause it to crash or worse yet, leak information.
    4. Process scheduling: this is a kind of DoS avoidance mechanism. It prevents a process from taking over the whole system -- I guess that makes it a kind of acces enforcement, but I don't think it fits neatly in any category.
    5. Disallowing user processes direct access to I/O devices. This is a great example of how doing so would bypass access control, allowing users to do thins for which they don't have the authority.

  2. (20 points) Discuss how virtualization and time-sharing are similar/different. Again, one paragraph for each of the similar and different should be sufficient.
    Similarities:
    • Both are about equitably sharing resources among different agents, where agent = user in a time shared system and agent=VM in a virtualized environment.
    • Both want to provide as much isolation as possible, performance isolation as well as protection
    • Time sharing systems and deschedule user processes in the same way that a VMM can deschedule a virtual machine
    Differences:

4. File System Design: Inode location management (15 points)

LFS keeps track of the locations of inodes by storing a large array in the ifile. The array uses the inode number as an index into this array containing the disk address for the corresponding inode.

In contrast, ZFS stores all dnodes directly in an object set (objset). Each dnode lives in a precise location in the objset (e.g., dnode N lives at offset sizeof(dnode) * N). So updating a dnode is just like updating any other part of an object or objset.

Both systems use a no-overwrite storage mechanism, so inodes and dnodes move around the disk. Explain in one to two paragraphs how these two approaches to tracking the locations of file metadata are similar and dissimilar. Think about how many I/Os are required to update files in each system.
These two approaches actually sound very similar to me. In LFS, we use a file, while ZFS uses an objset. An objset is simply a more general abstraction than a file. However, updates are handled similarly. In LFS, you update an inode and you write a new copy of that inode and simply change the disk address in the ifile which gets written to disk to be looked up later. in ZFS, you update a dnode and that changes the address of a pointer in the dnode of the objset. If we think of the number of I/Os, in LFS you get:

In ZFS you get So, ZFS uses on fewer I/O. BUT ... because an update to a dnode is "in place", you will write many dnodes out that didn't change (i.e., more than 1 dnode fits in a block). In LFS, you repack inodes together, so that if you modify inodes 5 500 and 5000, you can put them all in the same block. In ZFS, dnodes 5, 6, 7, etc will always be in the same block while dnodes 500 and 5000 will be in different blocks. So, ZFS may end up writing more total blocks.
At the end of the day, these approaches seem quite similar to me.

5. File System Design: Journaling user data

This is one of those long design questions. Read it through first, but then move on to some of the short answer questions, before diving into this one.

Today's modern file systems, like ZFS, support many features, such as:

These features result in data blocks being referenced from multiple places (e.g., two different files; two different snapshots).

Your job is to design a scheme for efficiently journaling user data in such a system. You should take advantage of the fact that blocks can be referenced from multiple locations. You should assume the following about your system (many of these ideas are taken from ZFS):

Part I (50 points)

In this part, you will explain your mechanism for journaling user data. The key design constraint that you must satisfy is that file data may only be written to persistent storage once. You should include a description of any log record (s) you wish to use. You should explain when these log records are created and when they must be written to persistent storage. You should also explain if the addition of these log records introduces any dependencies in your buffer management. Finally, discuss the advantages and disadvantages of your approach.

We suggest that you begin approaching this problem by figuring out everything that you need to do to handle a write system call. Then consider how you will handle the following two scenarios (where BLOCKSIZE is the size of a block in the file system):

Scenario 1
	char buf[BLOCKSIZE]

	write(fd, buf, BLOCKSIZE);
Scenario 2
	char buf[BLOCKSIZE]

	write(fd, buf, BLOCKSIZE/2);
	write(fd, buf+BLOCKSIZE/2, BLOCKSIZE/2);
Next start figuring out log records you will need and any dependencies in your buffer cache. Write that all up and then step back and assess it critically to discuss the advantages and disadvantages.
The key insight is that we can write the data once and then reference it both from the log and from a file. This introduces the following tricky parts:
  1. we will need to write the data to disk before we write the log record. This is TOTALLY backwards from practically everything we do. The only reason it is OK is because new data is always written to new blocks, so the only bad outcome from this is that we write data, no log record is ever written, (due to a crash) and those blocks continue to be treated as free blocks. This wastes disk bandwidth, but does not damage the file system's integrity.
  2. there are multiple steps involved in writing and logging data
  3. free space management just got a big more complicated.
When an application writes data to a file, we need to do the following:
  1. Find a free block into which to write the new data.
  2. Log the block allocation
  3. Write the data
  4. Write a log record referencing the newly written data block
  5. Log the change in the dnode
  6. Flush the log at commit
As discussed above, we need to make sure that we flush the data (step iii), before we let record iv go to disk (otherwise we have a log entry referencing data that is not yet written). Fortunately, we don't really have to flush the log on every commit; we can batch commits together and push them to disk once we've accumulated a lot of data. In the meantime, we might be able to write blocks that were written in their entirety to disk.
So, our new log record looks something like:
	txnid = TXNID
	type = data write
	file = identify the file (dnode)
	block = block number
	daddr = block pointer referencing the newly written block
	chksum = a checksum: not strictly necessary, but if only one copy of
		the block is successfully written, perhaps we can use this to
		figure out which of the multiple copies is correct and recovery
		partially written data?  That seems like a pretty advanced
		feature, so I'm not going to discuss it further, but it's a somewhat
		neat idea of a way to not have to write all N copies of a block.
When we create that log record, we also make it dependent upon the data block --we must not flush the log until that data block is safely on disk.
A huge disadvantage of this scheme is that instead of all log writes now being sequential, we have these data blocks that are strewn hither and yond around the storage system. In order to minimize the impact, I will asynchronously write blocks if we "fill" them. I'll call blocks "filled" when we write the last byte in them. So, if you write sequentially (as all good programs do), then when you finish writing a block, we'll write it asynchronously. If you write a partial block, we won't immediately write it, but will keep it cached until either A) you finishe writing it or B) we have to flush it so we can write a log record.
OK, so now our write algorith looks like the following: And our log flush algorithm looks like this: We can test the first condition by simply retaining in-memory a list of blocks on which chunks of the log depend (by chunk of log, I mean a decent size unit of log that I write).
Advantages Disadvantages

Part II (10 points)

Now consider any modifications that you might need to make to the system's free space management.
Our free space management already has to handle the case that a block can be referenced from multiple places, so that's a big win. The only thing we really need to figure out is:

  1. When does the log no longer need to retain its reference to a data block
  2. Where do we store that information
Item 1 is the big deal: the log is only required to keep track of things between checkpoints, so we know that the log can relinquish its reference to a block as soon as the next checkpoint occurs. That's great news! It means that all we need to ensure is that we don't fully reclaim any disk blocks until the checkpoint following its allocation - that doesn't sound tough does it?
Let's think about deallocation, because that's when we might have to think about this. So let's assume that we're doing a truncate (or a delete). There are two cases to consider:
The block was allocated in a previous checkpoint: no problem, we don't need the log record any more, so we can ignore its reference.
The block was allocated in the current checkpoint: the deallocation in and of itself isn't a huge problem, but if we were to reallocate it again, then we could create the following situation:
	
	Allocate and write block B
	...
	
	
	Free block B
	
	
	Allocate and write block B
	flush the block to disk
	crash...
Let's assume that B did get written to disk in txn 3, and furthermore, let's assume that the log record for the txn 2 didn't make it to disk. When we try to recover the log, our original write had better still be valid. So, this suggests that we cannot reallocate blocks allocated in the current checkpoint.
That doesn't sound so hard -- we need to keep track of the newly allocated blocks until log flushes (as mentioned above), so we can keep track of them until the end of the checkpoint, and simply delay deallocations for those blocks until the next checkpoint. That doesn't seem so hard.

6. Designing for Byte-Addressable Persistent Memory (80 points)

For this question, you will need to read this paper from the beginning up until section 4.1 (You will want to read about PCM, but you can stop after that.) The paper presents a way to build a file system on byte-addressable persistent memory. (You are, of course, free to read the entire paper, but I wanted to keep the portion necessary for the exam within reason.) You will undoubtedly find it useful to read the questions before you start reading.
  1. In your own words, describe short-circuit shadow paging.
    Short circuit shadow paging is a way of stopping copy on write behavior before you get all the way up to the root of a file system. Basically, you CoW until the point where you can effect the change with a single atomic write.
  2. Could you use short-circuit shadow paging in a more conventional block-based file system? If so, explain how you would use it. If not, explain why it's not possible.
    Yes, you can use a similar technique in a traditional file system, although it's a little fuzzy exactly how this would differ from a conventional system. The key is to figure out when you get to an atomic unit, which would typically be a sector. On one hand, most file system operations only affect a single block, but a block could be larger than a sector I suppose. Also, multi-block writes definitely span atomic units, so you'd need to propagate up until you're either in the same indirect block or at the inode.
  3. The paper introduces the problem caused by the fact that processors are allowed to reorder when data gets flushed to main memory. They introduce epoch barriers to address this. Disks also reorder requests. Explain how the situation between disks and memory are the same or different; if they are the same, explain what mechanism we use with disk drives to ensure proper ordering.
    The problem is similar -- that's why non-journaling, non-soft update systems end up having to perform synchronous I/O -- it's the only way to ensure that one I/O completes before another. Soft-updates is another way that we try to mitigate this, but at the bottom, it still boils down to synchronous writes.
  4. Explain how BPFS is well-suited to PCM as opposed to other, more traditional block-based systems that we've studied.
    For one, BPFS uses a much smaller block size since there is no mechanical apparatus whose performance needs to be amortized. Second, BPFS uses atomic 64 bit writes in several places; this would not be feasiable in a block-based system.
  5. It is likely that some form of byte-addressable persistent memory will be available on systems within the next few years. Propose a different way you might use this technology (i.e., something other than a file system).
    Making a persistent (non-volatile) heap seems like one possibly good idea. Making processes persistent for instant restart seems like another. Maintaining a persistent cache that you can continue to use after a failure seems like another.