I want to construct an abstract machine that is somewhat faithful
to the time and space requirements of a more realistic functional
language implementation. The goal here is to examine *intensional*
properties of programs (such as time, space, etc.) and not just
extensional aspects (eg. input/output).
Let's start with the following expression language:
e ::= var(i) | i | p | () | (e1,e2) | #i e | \e | e1 e2
Expressions include tuples, lambda-expressions, etc. However, instead
of using variables, we're going to use deBruijn notation. Recall,
that var(i) refers to the ith-nearest enclosing lambda. The role of
pointers will become clearer below where we see that we'll be using an
*allocation*-based semantics. In such a semantics, the only values
that we'll directly manipulate are small variables such as integers
(i), unit (), and pointers (p). Thus, larger values (e.g., tuples)
must be allocated and referenced by pointer.
As we saw earlier, to build a "tail-recursive" interpreter for
such a language, we'll need to use a control-stack of some sort.
This allows us to avoid re-computing the evaluation context
over-and-over again. In addition, to avoid the expensive operation of
substituting a value for a variable within some code, we'll be using
an *environment-based* semantics. Environments will be represented
using tuples and a variable will be treated as an environment lookup.
With this informal overview in mind, here are the syntactic constructs
of our abstract machine:
(configurations) M ::= (H,S,v,e)
(heaps) H ::= {p1=h1,...,pn=hn}
(heap values) h ::= (v1,v2) | [\e,v]
(small values) v ::= i | p | ()
(expressions) e ::= var(i) | v | (e1,e2) | #i e | \e | e1 e2
(stacks) S ::= nil | (F,v)::S
(frames) F ::= ([],e) | (v,[]) | #i [] | [] e | v []
So a machine configuration (M) consists of a heap, a stack, a value
(representing the current environment) and an expression to execute.
Heaps are partial functions from pointers to heap-values. Heap-values
are either pairs of small-values, or a closure consisting of a lambda
and its environment (itself a small value.) In practice,
lambda-expressions will be represented by reference to a piece of code
(i.e., yet another kind of pointer) so notice that all heap values are
really pairs of small values (of known size -- we're assuming machine
integers here, not bignums.)
Small values include integers, unit, and pointers. Expressions are as
sketched above, but it is convenient to add closures [\e,v] so that
they include all heap-values. Stacks are lists of frames coupled with
an environemnt value. The frames record what to do with a value once
it's computed. Note that in something like ML, we would write:
datatype frame = LeftPair of exp | RightPair of value |
Proj of int | LeftApp of exp | RightApp of value
so frames are themselves simple data structures that include a tag
and either a reference to an expression or a small-value. As we will
see, it's crucial that we record the environment that was active
when we push a frame on the stack.
Now we can phrase the rewriting rules for this abstract machine as
below. First, we deal with variables:
H(p) = (v1,v2)
----------------------------
(H,S,p,var(0)) -> (H,S,p,v1)
H(p) = (v1,v2)
-----------------------------------
(H,S,p,var(i+1)) -> (H,S,v2,var(i))
The first rule says that the environment must be a pointer p bound
in the heap to a tuple (v1,v2) and var(0) returns v1 as its result.
That is, the 0th variable is at the head of the list represented by
the environment pointer p. For var(i+1), we simply look at the tail
of the list and extract var(i) from it.
The next rule says we allocate heap-values:
(H,S,v,h) -> (H+{p=h},S,v,p)
So for instance we have (H,S,v,(1,2)) -> (H+{p=(1,2)},S,v,p). At this
point it's worth remarking that we consider a heap H =
{p1=h1,...,pn=hn} to bind all of the free occurrences of the pointers
pi within the machine configuration and take configurations to be
equivalent up to alpha-conversion of pointers. Thus, the abstract
machine can always pick a "fresh" pointer p when allocating a
heap-value. (We will formalize free and binding occurrences of
pointers below.)
Next we deal with all of the congruences which are responsible for
pushing frames on the stack and evaluating a sub-expression:
(H,S,v,(v1,e2)) -> (H,((v1,[]),v)::S,v,e2) (e2 not a small value)
(H,S,v,(e1,e2)) -> (H,(([],e2),v)::S,v,e1) (e1 not a small value)
(H,S,v,v1 e2) -> (H,(v1 [],v)::S,v,e2) (e2 not a small value)
(H,S,v,e1 e2) -> (H,([] e2,v)::S,v,e1) (e1 not a small value)
(H,S,v,#i e) -> (H,(#i [],v)::S,v,e) (e not a small value)
In each of these cases, we're pushing a frame coupled with the current
environment value v. We need to record what the environment was because
rules, such as the variable rules, may end up throwing information away
regarding the environment. (You may be tempted to avoid this overhead,
so we'll come back to this later.) One thing to note here is that
there is a simple, constant-time test to tell if an expression is or
isn't a small-value. Note that with our earlier stack machine, we may
have to crawl over a large value (e.g., ((v1,(v2,v3)),(v4,v5))) to
determine this.
As before, when the machine gets down to a (small) value, we can
"return" the value by popping a frame from the stack of the abstract
machine:
(H,(F,v)::S,v',v1) -> (H,S,v,F[v1])
again, the constant-time check for being a small value helps the
situtation. I'm going to define the good terminal configurations
for this machine to be of the form (H,nil,v,i) where i is an integer.
That is, well-typed programs will always produce integer results.
I'll say that eval_R(M,i) if M -R->* (H,nil,v,i) for some
evaluation relation R and I'll say eval_R(M,_|_) if there
exists an infinite sequence M1,M2,M3,... s.t. M -R-> M1 -R-> M2 -R-> M3
-R-> ...
The reduction of a projection operation is relatively straightforward:
H(p) = (v1,v2)
--------------------------
(H,S,v,#i p) -> (H,S,v,vi)
Now we are left with lambda-expressions and application. Note that
a lambda-expression is not itself a heap-value. Rather, we must turn
the lambda into a closure. The way to think about this machine is that
we're being lazy about substituting the environment within the expression.
To avoid this, we're interleaving substitution of the environment with
evaluation. Of course, when we get to a lambda, we stop evaluating,
but we can't stop doing the substitution! So we must remember whenever
we invoke the lambda to continue with the substitution that was in
effect at the time we created the closure, which is generally not the
environment at the point where the function is invoked. With these
facts in mind, the rules for lambda and application are as follows:
(H,S,V,\.e) -> (H+{p=[v,\e],S,v,p)
Finally, the rule for application is as follows:
H(p) = [v2,\e]
-------------------------------------
(H,S,v,p v1) -> (H+{p'=(v1,v2)},S,p',e)
Here, p must be bound in the heap to a closure [v2,\e]. We step to a
configuration where we are evaluating e. But we also change
environments. The new environment p' points to a list where the head
is the argument to the function (v1) and the tail is the old
environment that was packed up with the lambda to make the closure
(v2).
One of the things that I like about this abstract machine is that
it is relatively faithful to the actual time it takes for a program
to evaluate. In particular, none of the operations, with the exception
of allocation, really requires more than constant time to implement
on real machines. In this respect, the abstract machine that we've
given could be used to calculate big-O running times of programs.
Of course, ignoring allocation is probably a bad idea...
Again, it is worth remarking that we don't need to save anything on
the stack when we do a function call with this abstract machine---this
is crucial for getting loops to run in constant stack space. Indeed,
one is tempted to take this abstract machine as *the* definition of a
language like Scheme where we would like to dictate to implementors
that they must use constant stack space for iterative procedures.
But of course, this doesn't work. Consider an implementor that
first CPS-converts the program. Then we won't be doing *any*
stack allocation, and yet satisfy the "letter" of the law. That's
because all of the stack-frames will be represented using closures
in the heap!
Really, we must take garbage collection into account. But how do
we do this? One option is to add a new rewriting rule to the
language:
FP(H1,S,v,e) = {}
gc --------------------------
(H1+H2,S,v,e) => (H1,S,v,e)
This gc rule says that we can eliminate a portion of the heap (H2)
as long as the resulting program stays closed---that is, it has
no dangling references to H2. Formally we define:
FP(H,S,v,e) = (FP(H) + FP(S) + FP(v) + FP(e)) \ Dom(H)
Dom({p1=h1,...,pn=hn}) = {p1,...,pn}
FP({p1=h1,...,pn=hn}) = FP(h1) + ... + FP(hn) \ Dom(H)
FP([\e,v]) = FP(e) + FP(v)
FP(var(i)) = {}
FP(i) = {}
FP(()) = {}
FP(p) = {p}
FP((e1,e2)) = FP(e1) + FP(e2)
FP(#i e) = FP(e)
FP(\e) = FP(e)
FP(e1 e2) = FP(e1) + FP(e2)
We know that informally, the gc rule is justified since this is
what tracing garbage collectors do. But how do we formalize this?
What should a gc be allowed to do? One idea is that we could say
a rewriting rule (H,S,v,e) => (H',S',v',e') is *safe* if adding
it doesn't change the possible behaviors of any program. Formally,
A rewriting relation R2 is safe with respect to R1 if for
any configuration M and any answer ans (an integer or bottom),
eval_R1(M,ans) <==> eval_(R1+R2)(M,ans).
Unfortunately, the gc rule as stated above is *not* safe with
respect to our evaluation relation. The reason is a small
technical one---we can loop forever doing gc's on any program.
We can fix this by defining an evaluation relation that forbids
GC's from occurring one after the other.
How do we prove that adding the gc rule is now safe? Two
lemmas help with this:
[GC postponement]: If M1 =gc=> M2 and M2 -> M3, then there
exists an M4 s.t. M1 -> M4 =gc=> M3.
[GC fusion]: If M1 =gc=> M2 =gc=> M3 then M1 =gc=> M3.
This rule says that gc commutes with evaluation. So the intuition
is that we can take any (finite) evaluation sequence which interleaves
evaluation with gc and argue that all of the gc steps can be post-poned
until the end and fused into one big gc-step. Since gc doesn't change
the answer of the final configuration, it must result in the same
answer.
Now the great thing about our gc rule is that it can be applied
non-deterministically without affecting the meaning of the program.
The bad thing about the gc rule is that it is far from a constant-
time "step" in the evaluation of our abstract machine. Furthermore,
the non-determinism makes it hard to say just how much space or time
we'll take, because we can effectively trade space for gc steps.
One can formalize a small-step rewriting relation for gc (see
Morrisett, Felleisen, & Harper, FPCA'95) to make those steps explicit
for something like a copying-collector, which will reveal that the
time to do a gc is roughly proportional to the stack and the part of
the heap that is preserved. One could also imagine interleaving
these small steps with the evaluation steps (plus some extra state)
to get a form of incremental collection.
Of course, all of this makes it very hard to say just how much time or
space a given program will take. We have a very operational model of
this, but we can only really find out answers by "running" the whole
program (on ground inputs). I find this very disatisfying, but it
doesn point out that in a modern language, you can't really trade
space for time (at least if you're using a copying collector): some
work proportional to the live data is done just to support allocation.
And that work seems to grow with the live data. It would be nice if
we could come up with a way of specifying a model that allows us to
optimize the space-time product of a given algorithm. In my opinion
this is still a big open problem.
Another thing that I find interesting is that the definition of
a "safe" rewriting relation above really supports a *semantic*
notion of garbage collection. It allows us to do all sorts of
crazy things like deallocating any object that isn't accessed
in the future. It also allows us to rewrite code (i.e., online
optimization) or to compress/rewrite the stack, as long as we
get the same behavior out of the code.
The idea of semantic garbage is intruiging because there are in
fact algorithms that go beyond tracing to reclaim some reachable
objects which, nonetheless, aren't used in the future. A good
example of this is Baker's unification-based algorithm (rediscovered
by Appel, and a few other folks.) To understand this, we need
to see the typing rules for the abstract machine. The key rules
are this:
|- H : P P |- S : t'->int P |- v : G P;G |- e : t'
--------------------------------------------------------
|- (H,S,v,e) : int
All p in H. P;0 |- H(p) : P(p)
------------------------------
|- H : P
The key here is the notion of heap typing: We say P describes H
if whenever p=h is in the heap, P(p) is a type describing h.
P is the "interface" between the heap and the rest of the machine
including the stack, the environment, and the expression being
evaluated. It turns out that if a given pointer is unreachable,
we can assign it any type we like. Furthermore, we only need
to assign a pointer a type that is consistent with how that pointer
may be used by the stack, environment, and expression. If,
for instance, p : (int*int) * (int*int) but the only free occurrence
of p in the program is in e and is of the form #1 p, then we can
assign p the type (int*int) * Top. If there's another use of p
in the program that say performs #1 (#2 p) + #2 (#2 p) then
we can't assign p this type. So the idea behind Baker's algorithm
is to crawl around the expression e, the stack, and the environment
and try to assign each pointer the least constained type we possibly
can. In essence, we use ML-style type-checking during GC to
come up with constraints on pointers and try to find the least
specialized type that is consistent with those constraints. If
we ever end up assigning a pointer the type Top, then we know the
contents of that pointer will never be dereferenced, so we can
safely deallocate the pointer.
In practice, Baker's algorithm has proven too hard to implement
and get real speedups. But it does suggest that we can do a lot
more in terms of deallocation. As another example, the region
inference work of Tofte & Talpin figures out that some values
aren't needed in the future based on effect information. Still
another example is that we could calculate when variables are
live/dead and use this to avoid preserving values (in practice,
this is crucial for minimizing leaks in a real language.) All
of these tricks are aiming at closing in on semantic garbage.