I. Scheduling (28) ------------------ (C) Shortening the scheduling quantum produces fairer scheduling. This is only true in a minor way. In general, the length of the quantum is pretty orthogonal to fairness. However, if you have really short quanta, then you get to reschedule more frequently. Therefore, it does give you the ability to manage the allocation of time to processes at a finer grain, which could lead to fairer outcomes (however your algorithm could choose to ignore fairness in which case it would not improve fairness at all). (A) Shortening the scheduling quantum will make a system more responsive to interactive jobs. This is similar in spirit to the previous one -- because you have the opportunity to schedule more frequently, your system has the potential to be more responsive. Whether it is, in fact, more responsive has more to do with how processes are scheduled than the length of the quantum. (D) The length of the scheduling quantum dictates how long a compute bound process runs before another process is scheduled. This is mostly true -- the quantum dictates a minimum amount of time it will run in most circumstances, and assuming equal priorities and other things, it also dictates a maximum amount of time. That said, there are situations where this is not true: 1. If there aren’t any other processes running, well, then you can certainly reschedule the currently running process. 2. Even if there are other processes running, if you have a hybrid system with priorities, it might be the case that you want to continue running the current process, because it is of a higher priority than anyone else. 3. If you have a real time system, then there may be other reasons that you want to run the process again. 4. You could just get lucky in a lottery scheduler. 5. And if you've got a scheduler that favors interactive jobs and some annoying interactive job becomes runnable, that could end up descheduling the poor CPU bound job, thus making the quantum not be the minimum length of time either. (B) Shortening the scheduling quantum increases system overhead. This is always true -- if you make the quantum shorter, then you have more scheduling events per unit time. Therefore, whatever cost the scheduler incurs happens more frequently, and this means that there is more overhead incurred. II. Virtualization (45 points) ------------------------------ A. (30 points) The VMM places the guest OS and the guest application in separate processes, which provides memory isolation between the two. The VMM maps itself into the top part of each process' address space, using page table protections to prevent user-mode code from tampering with VMM code. [The guest OS is relocatable, so the guest OS will be fine if it is not located in high virtual memory.] The guest application invokes a system call by executing the standard trap instruction. The trap is handled by the VMM, who then sets up the processor's register state to emulate the trap-time state of the user-level application. The VMM then swaps the page table pointer to point to the guest OS's address space, and fires the trap handler of the guest OS. The guest OS extracts the trap frame, determines the process that generated the trap, and writes the pid value to the expected part of the trap frame on the guest OS's kernel stack. The guest OS resets the processor's register state using the trap frame, and tries to perform a rfe ("return from exception") instruction. However, the guest OS is running in unprivileged mode, so the processor traps to the VMM. The VMM then swaps the page table pointer, and ensures that the processor still has the trap frame register state (suitably updated with the return value from getpid()). The VMM then restarts the user application. B. (15 points) Supporting a data-movement system call like read() is tricky, since the guest OS and the guest application are in two separate address spaces. So, there are two possibilities: 1) The VMM can act as an intermediate buffer for the IO, such that the data returned by the read is copied from the guest OS to the VMM, and then from the VMM to the guest application. Since the VMM is mapped into the high virtual memory of the guest OS and the guest application, each copy operation looks like a standard copyin/copyout. 2) Alternatively, the VMM can establish a shared memory segment between the guest OS and the guest application. This would allow the guest OS to place the read() data into a location that is directly accessible by the guest OS and the guest application. III. Security (40 points) ------------------------- A. (10 points) The new scheme is less secure, because the hardware has no way to verify that the LLB itself has been blessed by Apple! So, nothing prevents an attacker from replacing an Apple-blessed LLB with one that contains a fradulent version of Apple's public key. The attacker can sign all of the higher-level software components using the attacker-controlled key. The LLB will pass formal verification, since the LLB will respect the constraints given by the read-only firmware. B.1. (20 points) P's file descriptor table should destroyed (and a fresh one created) before P_new resumes its post-exec() execution using P's resources. Destroying the file descriptor table prevents a situation in which, e.g., P opens a sensitive file, and exit()s before closing the file. If P's file descriptor table were not closed, P_new could read sensitive data through the preexisting file descriptor. B.2. (10 points) P_new can reuse P's pid--the kernel already has to support PID reuse, and pid values are not used as capabilities to access sensitive information. IV. File Systems (27 points) ---------------------------- A.1 The answer here depends on assumptions about distribution of file sizes. Metadata for small files will increase, because inodes consume an entire block and most of that will be wasted. However, for large files, meta-data shrinks, because each block pointer now references twice data. While doubling the block size increases the max file size exponentially, it reduces meta-data size only by roughly a factor of 2. A.2 Sequential performance should improve because every seek returns twice as much data. A.3 Random IO would be just a tad slower because you're transferring a bigger block and only reading 1 byte, but transfer time is not a significant fraction of access. A.4 This should have no impact on directory traversal (if you had huge directories it might speed that up a tad). It should also have no effect on rename. B.1 None B.2 None B.3 None B.4 Ideally, it would speed up directory traversal, because most pathnames would hit in the cache. However, huge parts of the cache get invalidated on a rename, so there might be a hit here. EXTRA POINT: the directory operations question has two effects: hitting in the cache, but directory rename suddenly invalidates huge parts of your cache and that would incur at least a small performance hit. C.1 None -- we aren't introducing an new/different meta-data, just changing how we use what we have. C.2 This will likely get worse for files that you have modified -- that is, when you update blocks, if the block was in a good, sequential position, you have to move it somewhere else and that is likely to hurt performance. C.3 None. C.4 This should be a no-op for small directories. For directories that span multiple blocks though, it could slow down lookup time, because your directory blocks will be scattered all over the disk. I don't think it has a measurable impact on rename time. V: Virtual Memory (20 points) ----------------------------- A MIPS TLB entry is already tagged with an ASID, and EntryHi (which we assume is still a per-core register) already contains bits to represent the ASID for the currently executing process. So, the (now global) TLB can always distinguish between translation requests for the same virtual address in two different processes. A TLB entry would only have to be invalidated when the virt-to-phys mapping for the given ASID has changed (e.g., when the OS swaps a page in from disk, or when the OS kills a process with a particular ASID and wants to reuse the ASID for another process). There would be no need for cross-core IPIs. VI: CMADD paper --------------- A. I'm going to try a mechanism that combines ideas from journaling and no-overwrite file systems. (This is not the only solution possible.) If an application write is small enough to fit in a log record, I'll write two log records: one with the before image and one with the after image of the data. (Call this case 1.) If the write does not fit in a log record, then I'm going to conceptually break up the write into operations on blocks. Once I've done that, the first and last blocks might fall into case 1; if they do, I'll handle them as I do in case 1. All remaining modified blocks fall into case 2: We write the new data to the log as a whole block: this will require a change to the jphys layer but I'm a kernel programmer, so I can do that! (Note that I'm not going to allow arbitrarily large blocks; instead I"m going to make it possible to write a single block of data to the log (the record(s) immediately before will describe the block or blocks the follow). This is the after image - the block itself will be the before image (special case: if it's a newly allocated block, we can record that in the preceding log record to avoid writing an extra block). So, here are my new records: To allow updates up to almost a full page, we'll have two separate types for pre- and post- images: 1. Data_write_pre and data_write_post have identical contents, but are different types. txnid_t txnid; // ID of the transaction uint16_t datalen; // Number of data bytes written uint16_t offset; // Offset in block where write starts daddr_t blknum; // Disk address where this belongs 2. Block_write This is the record that precedes a block of data in the log txnid_t txnid; // ID of the transaction daddr_t blknum; // Block described by the log block sfs_lsn_t lsn; // LSN of the log block uint32_t cksum; // Checksum of newly written data uint32_t flags; // So far the only flag is whether the // block was newly allocated. Note that you can write several of these records and then emit the log block writes -- the log block writes require a new jphys API that writes to the log without adding a jphys header (jphys may want a pre-record so that it can do its own recogery as well). Ordering constraints: Case 1 requires only the normal WAL constraints. Case 2 is bit trickier: you have to make sure that the pre-image of the block is already on disk -- so before writing the log record, you need to make sure that the block in the cache is clean (i.e., the pre-image is on disk). We do this before writing the log record to avoid any possible cycles in the order of writes that need to happen. Additionally, you may not evict a block from the buffer cache until its write transaction has committed. Therefore, you want to place the commit LSN on the data buffer, not the LSN of the write record. (This isn't a huge deal since our transaction boundary is a single write call.) The one place this will get into trouble is if you're doing a write that is larger than the buffer cache. I'm going to disallow this for now, although you could imagine making it work by doing a no-overwrite update to the file if forced to evict before commit. UNDO/REDO Case 1 is trivial: we have both before and after images, so we have two options: We blindly redo and undo as established by the abort/commit status of the encompassing transaction (this assumes that recovery does an undo first strategy -- that is, do a backward pass first undoing any aborted TXNs and then redo forward. Case 2 is quite similar -- you need not do anything for UNDO as the pre-image is on disk; for REDO, you just apply the update to the disk block. (Note that if there is a later UNDO, the state to which you're undoing is the result of some prior write, so recovery is restoring the blocks to their correct state. B) Conceptually, a workload where file integrity can be encapsulated in a single write would be similarly effective under journaling with user data and CMADD. So, consider an application level log -- all writes happen at the end, but they are arbitrarily long. The user-data journaled SFS will make sure that entire application log writes appear in the file as will CMADD. C) Let's imagine that I'm implementing a B-tree and a node is a page. When I do a split, there are three different pages that need to be written to ensure consistency -- there is no guarantee they are contiguous, so I'd need to do three separate writes so the atomic write guarantee of SFS would not suffice. D) The fundamental mismatch is that the SFS journal approach is based on fairly short transactions whose begin and end times are well-identified (when you enter/exit a system call). If you cast CMADD into a transaction framework, transactions are started implicitly on open, commited on an msync/fsync, with a new transaction immediately started. Using a pure journaling approach means that you have long-running transactions. This could greatly complicate checkpointing and potentially create scenarios where you run out of log space. E) A journal-based approach is either going to have to commit both before- and after- images to the log or it would have to buffer all updates in memory. Given the implicit nature of transactions, buffering seems impractical, so that means that you'll have to do before and after image logging for ALL files opened with O_ATOMIC. So, the journaling will not consume any additional space in the file system itself, but the log is going to be huge huge huge -- potentially containing all data modified between the time a file is open/closed. In the paper system, the journal isn't particularly large (the only new record is the one that records the multiple files participating in a syncv), but there is duplication of data and inodes on the disk. If you make multiple updates to the same block, the paper system will be much more efficient, because repeated writes will go to a single COW block, while the journaling system will have before and after images of every update (you could imagine optimizing those out, but it would be tricky). If you never write the same block twice, the CMADD system is still more efficient, because although you do have a copy of the inode, you have only one copy of the original block; in the journaling system you end up with the original data plus both before and after images in the log. (If inodes are the same size as blocks, this ends up a wash, although the paper system will duplicate indirect blocks as well.) F) I think recovery will be significantly easier in the CMADD system. The only thing you have to process are the log records written for synchv. If those transactions commit, you have to explicitly make sure that their clones become the real copy, because you clean up all remaining clones at open time, by throwing them away.