I. Scheduling (28)
------------------
(C) Shortening the scheduling quantum produces fairer scheduling.

This is only true in a minor way. In general, the length of the quantum
is pretty orthogonal to fairness.  However, if you have really short quanta,
then you get to reschedule more frequently. Therefore, it does give you the
ability to manage the allocation of time to processes at a finer grain, which
could lead to fairer outcomes (however your algorithm could choose to ignore
fairness in which case it would not improve fairness at all).

(A) Shortening the scheduling quantum will make a system more responsive to
interactive jobs.

This is similar in spirit to the previous one -- because you have the
opportunity to schedule more frequently, your system has the potential to
be more responsive. Whether it is, in fact, more responsive has more to do
with how processes are scheduled than the length of the quantum.

(D) The length of the scheduling quantum dictates how long a compute
bound process runs before another process is scheduled.

This is mostly true -- the quantum dictates a minimum amount of time it
will run in most circumstances, and assuming equal priorities and other things,
it also dictates a maximum amount of time.  That said, there are situations
where this is not true:

1. If there aren’t any other processes running, well, then you can certainly
   reschedule the currently running process.

2. Even if there are other processes running, if you have a hybrid system
   with priorities, it might be the case that you want to continue running
   the current process, because it is of a higher priority than anyone else.

3. If you have a real time system, then there may be other reasons that you
   want to run the process again.

4. You could just get lucky in a lottery scheduler.

5. And if you've got a scheduler that favors interactive jobs and some annoying
interactive job becomes runnable, that could end up descheduling the poor CPU
bound job, thus making the quantum not be the minimum length of time either.

(B) Shortening the scheduling quantum increases system overhead.

This is always true -- if you make the quantum shorter, then you
have more scheduling events per unit time. Therefore, whatever cost the
scheduler incurs happens more frequently, and this means that there is more
overhead incurred.



II. Virtualization (45 points)
------------------------------
A. (30 points) The VMM places the guest OS and
the guest application in separate processes, which
provides memory isolation between the two. The VMM
maps itself into the top part of each process'
address space, using page table protections to
prevent user-mode code from tampering with VMM code.
[The guest OS is relocatable, so the guest OS will
be fine if it is not located in high virtual memory.]

The guest application invokes a system call by executing
the standard trap instruction. The trap is handled by
the VMM, who then sets up the processor's register state
to emulate the trap-time state of the user-level
application. The VMM then swaps the page table pointer
to point to the guest OS's address space, and fires the
trap handler of the guest OS. The guest OS extracts the
trap frame, determines the process that generated the
trap, and writes the pid value to the expected part of
the trap frame on the guest OS's kernel stack. The guest
OS resets the processor's register state using the trap
frame, and tries to perform a rfe ("return from exception")
instruction. However, the guest OS is running in
unprivileged mode, so the processor traps to the VMM. The
VMM then swaps the page table pointer, and ensures that
the processor still has the trap frame register state
(suitably updated with the return value from getpid()).
The VMM then restarts the user application.

B. (15 points) Supporting a data-movement system call
like read() is tricky, since the guest OS and the
guest application are in two separate address spaces.
So, there are two possibilities:
1) The VMM can act as an intermediate buffer for the
IO, such that the data returned by the read is copied
from the guest OS to the VMM, and then from the VMM
to the guest application. Since the VMM is mapped into
the high virtual memory of the guest OS and the guest
application, each copy operation looks like a standard
copyin/copyout.
2) Alternatively, the VMM can establish a shared
memory segment between the guest OS and the guest
application. This would allow the guest OS to place
the read() data into a location that is directly
accessible by the guest OS and the guest application.


III. Security (40 points)
-------------------------
A. (10 points) The new scheme is less secure, because
the hardware has no way to verify that the LLB itself
has been blessed by Apple! So, nothing prevents an
attacker from replacing an Apple-blessed LLB with one
that contains a fradulent version of Apple's public
key. The attacker can sign all of the higher-level
software components using the attacker-controlled key.
The LLB will pass formal verification, since the LLB
will respect the constraints given by the read-only
firmware.

B.1. (20 points) P's file descriptor table should
destroyed (and a fresh one created) before P_new
resumes its post-exec() execution using P's resources.
Destroying the file descriptor table prevents a
situation in which, e.g., P opens a sensitive file,
and exit()s before closing the file. If P's file
descriptor table were not closed, P_new could read
sensitive data through the preexisting file descriptor.

B.2. (10 points) P_new can reuse P's pid--the kernel
already has to support PID reuse, and pid values are
not used as capabilities to access sensitive information.


IV. File Systems (27 points)
----------------------------
A.1 The answer here depends on assumptions about distribution of file
sizes. Metadata for small files will increase, because inodes consume
an entire block and most of that will be wasted. However, for large
files, meta-data shrinks, because each block pointer now references twice
data. While doubling the block size increases the max file size
exponentially, it reduces meta-data size only by roughly a factor of 2.

A.2 Sequential performance should improve because every seek returns twice
as much data.

A.3 Random IO would be just a tad slower because you're transferring a
bigger block and only reading 1 byte, but transfer time is not a
significant fraction of access.

A.4 This should have no impact on directory traversal (if you had
huge directories it might speed that up a tad). It should also have
no effect on rename.

B.1 None
B.2 None
B.3 None
B.4 Ideally, it would speed up directory traversal, because most pathnames
would hit in the cache.  However, huge parts of the cache get invalidated
on a rename, so there might be a hit here.

EXTRA POINT: the directory operations question has two effects: hitting in
the cache, but directory rename suddenly invalidates huge parts of your cache
and that would incur at least a small performance hit.

C.1 None -- we aren't introducing an new/different meta-data, just changing
how we use what we have.

C.2 This will likely get worse for files that you have modified -- that is,
when you update blocks, if the block was in a good, sequential position,
you have to move it somewhere else and that is likely to hurt performance.

C.3 None.

C.4 This should be a no-op for small directories. For directories that span
multiple blocks though, it could slow down lookup time, because your directory
blocks will be scattered all over the disk. I don't think it has a measurable
impact on rename time.


V: Virtual Memory (20 points)
-----------------------------
A MIPS TLB entry is already tagged with an ASID,
and EntryHi (which we assume is still a per-core
register) already contains bits to represent the
ASID for the currently executing process. So,
the (now global) TLB can always distinguish
between translation requests for the same virtual
address in two different processes. A TLB entry
would only have to be invalidated when the
virt-to-phys mapping for the given ASID has changed
(e.g., when the OS swaps a page in from disk,
or when the OS kills a process with a particular
ASID and wants to reuse the ASID for another
process). There would be no need for cross-core
IPIs.


VI: CMADD paper
---------------
A.

I'm going to try a mechanism that combines ideas from journaling and
no-overwrite file systems. (This is not the only solution possible.)

If an application write is small enough to fit in a log record, I'll write
two log records: one with the before image and one with the after image of
the data. (Call this case 1.)

If the write does not fit in a log record, then I'm going to conceptually
break up the write into operations on blocks. Once I've done that, the first
and last blocks might fall into case 1; if they do, I'll handle them as I
do in case 1.  All remaining modified blocks fall into case 2:

We write the new data to the log as a whole block: this will require a change
to the jphys layer but I'm a kernel programmer, so I can do that! (Note that
I'm not going to allow arbitrarily large blocks; instead I"m going to make
it possible to write a single block of data to the log (the record(s)
immediately before will describe the block or blocks the follow).  This is
the after image - the block itself will be the before image (special case:
if it's a newly allocated block, we can record that in the preceding log
record to avoid writing an extra block).

So, here are my new records:

To allow updates up to almost a full page, we'll have two separate types
for pre- and post- images:

1. Data_write_pre and data_write_post have identical contents, but are
different types.

txnid_t		txnid;		// ID of the transaction
uint16_t	datalen;	// Number of data bytes written
uint16_t	offset;		// Offset in block where write starts
daddr_t		blknum;		// Disk address where this belongs

2. Block_write
This is the record that precedes a block of data in the log

txnid_t		txnid;		// ID of the transaction
daddr_t		blknum;		// Block described by the log block
sfs_lsn_t	lsn;		// LSN of the log block
uint32_t	cksum;		// Checksum of newly written data
uint32_t	flags;		// So far the only flag is whether the 
				// block was newly allocated.


Note that you can write several of these records and then emit the log block
writes -- the log block writes require a new jphys API that writes to the log
without adding a jphys header (jphys may want a pre-record so that it can do
its own recogery as well).

Ordering constraints:

Case 1 requires only the normal WAL constraints.
Case 2 is bit trickier: you have to make sure that the pre-image of the block
is already on disk -- so before writing the log record, you need to make sure
that the block in the cache is clean (i.e., the pre-image is on disk).  We do
this before writing the log record to avoid any possible cycles in the order
of writes that need to happen.

Additionally, you may not evict a block from the buffer cache until its
write transaction has committed.  Therefore, you want to place the commit LSN
on the data buffer, not the LSN of the write record. (This isn't a huge deal
since our transaction boundary is a single write call.)  The one place this will
get into trouble is if you're doing a write that is larger than the
buffer cache.  I'm going to disallow this for now, although you could imagine
making it work by doing a no-overwrite update to the file if forced to evict
before commit.

UNDO/REDO

Case 1 is trivial: we have both before and after images, so we have two options:
We blindly redo and undo as established by the abort/commit status
of the encompassing transaction (this assumes that recovery does
an undo first strategy -- that is, do a backward pass first undoing
any aborted TXNs and then redo forward.

Case 2 is quite similar -- you need not do anything for UNDO as the pre-image
is on disk; for REDO, you just apply the update to the disk block. (Note that
if there is a later UNDO, the state to which you're undoing is the result of
some prior write, so recovery is restoring the blocks to their correct state.

B) Conceptually, a workload where file integrity can be encapsulated in
a single write would be similarly effective under journaling with user
data and CMADD.  So, consider an application level log -- all writes happen
at the end, but they are arbitrarily long. The user-data journaled SFS will
make sure that entire application log writes appear in the file as will
CMADD.

C) Let's imagine that I'm implementing a B-tree and a node is a page. When I
do a split, there are three different pages that need to be written to ensure
consistency -- there is no guarantee they are contiguous, so I'd need to do
three separate writes so the atomic write guarantee of SFS would not suffice.

D) The fundamental mismatch is that the SFS journal approach is based on fairly
short transactions whose begin and end times are well-identified (when you
enter/exit a system call).  If you cast CMADD into a transaction framework,
transactions are started implicitly on open, commited on an msync/fsync, with
a new transaction immediately started.

Using a pure journaling approach means that you have long-running transactions.
This could greatly complicate checkpointing and potentially create scenarios
where you run out of log space.

E) A journal-based approach is either going to have to commit both
before- and after- images to the log or it would have to buffer all
updates in memory. Given the implicit nature of transactions, buffering
seems impractical, so that means that you'll have to do before and after
image logging for ALL files opened with O_ATOMIC.

So, the journaling will not consume any additional space in the file
system itself, but the log is going to be huge huge huge -- potentially
containing all data modified between the time a file is open/closed.
In the paper system, the journal isn't particularly large (the only new
record is the one that records the multiple files participating in a syncv),
but there is duplication of data and inodes on the disk.

If you make multiple updates to the same block, the paper system will be
much more efficient, because repeated writes will go to a single COW block,
while the journaling system will have before and after images of every
update (you could imagine optimizing those out, but it would be tricky).
If you never write the same block twice, the CMADD system is still more
efficient, because although you do have a copy of the inode, you have only
one copy of the original block; in the journaling system you end up with
the original data plus both before and after images in the log.  (If inodes
are the same size as blocks, this ends up a wash, although the paper system
will duplicate indirect blocks as well.)

F) I think recovery will be significantly easier in the CMADD system.
The only thing you have to process are the log records written for synchv.
If those transactions commit, you have to explicitly make sure that their
clones become the real copy, because you clean up all remaining clones
at open time, by throwing them away.