1. What is safety? Liveness? Which is easier to prove, and why? Safety -- a relationship between an implementation and a specification that shows that the implementation can never get into a state not in the specification. Liveness -- a relationship between an implementation and a specification that shows that the implementation moves toward an "end" state in the specification. Safety is generally easier to prove, because it can usually be shown by using invariants. Liveness proofs generally require reasoning about infinite sequences of actions, and so are intrinsically harder than safety proofs, which only require reasoning about finite runs. In addition, liveness is often comparable in difficulty to the halting problem since liveness properties often entail termination. --- 2. Define synchronous and asynchronous systems. Why is most formal work done on asynchronous systems? asynchronous -- the system does not make _any_ assumptions about time. synchronous -- the system makes _some_ assumptions about time. In particular, these may include bounded message delays, bounded computation times, a globally synchronized clock, etc. Most work is done on asynchronous systems because they are more general and more realistic -- any (positive) result shown for asynchronous systems is also true for synchronous ones, but not vice versa. --- 3. What is a stable predicate? What is the relationship between a possible state and a definite state? A stable predicate is one that, once true, remains true. A possible state is one that could have been entered on a valid run of the system (one consistent with the observations). A definite state is one that is true on all possible valid runs of the system. Definite implies possible, but it can also be the case that definite(X) and possible(!X), because !X could have been possible at a different point in the execution. --- 4. List six types of benign failure, from weakest to strongest. Failstop Crash Send, receive ommission (about equal) General ommission Timing failure (sometimes called performance failure) --- 5. According to Waldo, et. al., what are the four characteristics of a distributed system that make it impossible to write distributed systems in the same way that local systems are written? Non-uniform memory model, partial failure, greater latency, and inherent concurrency. --- 6. What three properties define reliable broadcast? Agreement -- if a correct process delivers M, all correct processes eventually deliver M Validity -- If a correct process broadcasts M, it eventually delivers M. Integrity -- For any message M, every correct process delivers M at most once, and only if some process broadcast M. --- 7. Why is atomic broadcast impossible in a deterministic asynchronous system? Atomic broadcast requires that all correct processes see all the broadcasts in exactly the same order. This implies a total order on all of the broadcasts, which is equivalent to globally synchronized clocks. --- 8. Discuss the type truncation problem in RPC based systems. Describe how this problem is solved in CORBA, Modula-3 Network Objects, and Java RMI, and point out some of the consequences of these solutions. The type truncation problem arises when using RPC systems with object oriented languages that support inheritance and polymorphism. When making an RPC call to an interface with some declared parameter or return type, the system has to determine what to do if a subtype of the declared type is passed or returned. Unlike single-address-space systems, there is no guarantee in a distributed system that the process receiving the subtype will have the code associated with that subtype. CORBA solves this problem by always treating the object as exactly the type declared in the interface. This is problematic because it more or less makes inheritance useless for use in RPC, but a better solution is hard to imagine in a language independent system. Modula-3 uses the best match--the class lowest in the class hierarchy that's present on both sides of the call. This seems better than CORBA at first glance, but leads to even more problems, because calls become non-deterministic from the point of view of the caller. Java RMI takes advantage of the fact that Java is bytecode compiled, and so arranges to download the code for the exact class being passed. This introduces other issues, but is probably the best solution from the programming point of view. --- 9. Explain the Jini notion of a lease. How are leases used in the Jini system? How do leases interact with the transaction system in Jini? A Jini lease is, at it's most abstract, a time-based convention to determine the actions of a pair of distributed objects on some action to mutually take if there is a failure that makes communication impossible sometime in the future. In most of its uses, it is a mechanism by which the grantor of the lease will do its best to keep a resource available to the lease holder for a certain amount of time. The leaser can relinquish the lease at any point before it expires, and can attempt to renew the lease at any time. After the lease expires, the lessee can free up or reuse any resources that were held by the lease. This feature allows Jini systems to deal with various partial failures, and so leases are used extensively to control access to many resources. Transactions are also leased, so that if some of the nodes in the transaction crash or get disconnected while it's going on, the transaction is aborted when the lease expires. However, there is a complication: to preserve transactional semantics, the leases are ignored once the transaction enters the voting phase: once the transaction manager has started collecting votes, it must wait until it hears from all the participating processes and then inform them whether to commit or abort. Participants in the transaction, once they have voted to roll forward, must also wait until they are told the outcome of the vote by the transaction manager. This can take a long time if machines are down, but is unavoidable if one really wants true transactions. 10. Describe the relationship between a vector clock and a distributed system history. Explain how the requirements for consistency based on vector clocks can be re-stated in terms of a consistent cut on a history. A local history, hi, of a process pi is the set of all of the events ei that occur in that process, along with the total ordering of those events. The global history of a system made up of processes p1...pn is the union of the local histories h1...hn, that is, it is all of the events that occurred in any of the local histories along with the partial ordering induced by those local histories. A vector clock for any process that is part of distributed system made up of n processes is a vector of size n. Each entry in the vector is a timestamp for some process in the distributed system. Every vector clock is initialized to have all 0 values, and after that the value of the clock at process i is incremented by one in the ith place for every internal event, and for every receive event the value of the clock at each index other than i is the maximum of the value of the current value and the value on the incoming time stamp (this assumes that all messages are time stamped) while the value at the ith place is again incremented by one. As a result, the value of each entry in a vector clock can be associated with an event at the process that corresponds to that position in the vector. Such an identification of events, one per process, can be used to identify a frontier of a history, which is the set of events in each process up to and including the frontier event. This also identifies a cut in the history of the computation as a whole. Any vector clock will identify a cut that is consistent, that is, there will be no receive event in the cut that is not matched by a previous send event that is also in the cut. But any receive event will increase the value of the vector clock in the receiving process to include the index at the time of send (which includes the send event) from the sending process for the index corresponding to the sending process. So the act of updating the clock at the receive will insure that the vector clock of the receiving event includes the prior send. Similarly, given a cut identified by the frontier events e1...en, we can take the vector clocks associated with each of the events and determine if the cut is consistent. We can do this by insuring that For all i,j : 1<=i <= n, 1 <=j <=n : VC(ei)[i] >= VC(ej)[i] That is, for all of the events, the vector clock value for the event in its own process is greater than or equal to the vector clock value for that process in any of the other event's vector clocks. If this is the case, it insures that no process includes a receive from another process that doesn't include the send event in that second process. 11. What is the gap detection problem in logical clocks? Explain why logical clocks have this problem, and then show that vector clocks do not in a particular special case. Why is this special case important? A logical clock for a process is a single integer value that provides a timestamp for the events in the process. If the event is an internal or send event, the logical clock is incremented by 1; if the event is a receive event, the logical clock is set to the maximum of the current value or the value of the timestamp on the event being received, and then the clock is incremented by one. Logical clocks can be used to order events, but are subject to the gap detection problem. This problem is that, given two two events e1 and e2 with logical clock value c1 < c2 -1, there is no way to determine if there is a third event e3 such that e1 < e3 < e2. This is because logical clock values can jump over some integer values, and there is no way to differentiate between a clock that has jumped a value and a clock that is missing a value because of some event that has that value. Vector clocks have a form of weak gap detection, that is, if we have event e1 of pi and event e2 of pj then if VC(e1)[k] < VC(e2)[k] for some k != j then there is some event e3 such that ~(e3 -> e1) and (ek -> e2). This tells us only that there is some event e3 that is prior to e2 and not after e1. But if k = i (the special case) knowing that e3 is not after e1 allows us to conclude that e3 is before e1, since in a single process all events are totally ordered. This case is important in that it allows us to detect gaps in the event stream from a single process. 12. Given a proof that the algorithm given in Fig. 5 of Hadzilacos and Toueg transforms a reliable broadcast algorithm into a FIFO reliable broadcast. To prove that the algorithm in Figure 5 provides FIFO reliable broadcast, we only need to show that it provides FIFO properties, since it is already built on top of a reliable broadcast mechanism. FIFO broadcast requires that we insure FIFO order, defined as If a correct process broadcasts a message m before it broadcasts a message m' then no correct process delivers m' unless it has previously delivered m. Suppose that a correct process has broadcast m before it broadcasts m'. Part of the algorithm requires that the broadcasting process send a monotonically increasing sequence number with all messages; therefore the sequence number of m will be smaller than the sequence number of m'. The correct receiving processes keep a queue for each sending process and a next number, indicating the sequence number of the next message from that process to be delivered. The receiving process will only deliver a message when it is the next message (by sequence number) to be delivered, and only increments the next value when it delivers a message with a sequence number equal to the current value of that number. So a correct process will only deliver m' when the sequence number of m' is equal to the internal variable next. But for next to reach that number, it had to have delivered messages from the sending process with all previous sequence numbers. Since m is such a message, the correct process must have delivered m prior to delivering m', which was to be proved.