In addition to probabilistic models, IBAL allows the description of decision problems, which can then be solved using decision theory. A decision problem consists of a set of choices available to an agent, a probability distribution over different outcomes for each of the choices, and utilities achieved by the agent under different outcomes. The decision problem also specifies the information available to the agent.
Decision problems fit naturally into IBAL's generative language. The key difference is that a decision variable has no associated expression defining how it is generated. Instead, we assume that its value will be chosen so as to maximize the agent's expected utility. It is up to IBAL's inference engine to figure out what that value is!
Let us illustrate with a series of examples. The first example, in
speed1.ibl describes a situation where a driver is late for an
appointment, and must decide whether or not to speed. Speeding
increases the likelihood of arriving on time, but runs the risk of a
speeding ticket. The decision variable is defined using the
declaration
decide speedThis simple declaration says that
speed is a Boolean variable
whose value will be chosen so as to maximize expected utility.
A decision variable can also be a Symbol, declared with syntax such as
decide speed from 'yes, 'noor an Integer, declared by
decide speed from 50 .. 90
The program speed1.ibl goes on to specify the consequences of
speeding: the likelihood of arriving on time and getting a ticket. In
order to complete the decision problem, we must specify the utility
associated with each possible outcome. This is done with the
following code:
reward case ticket of # true : -100.0 # false : 0.0 reward case on_time of # true : 10.0 # false : -10.0The
reward case form has similar syntax to a case
expression, except that the item to the right of the colon is a
floating point utility. In our example, there are actually two
reward case declarations. The different declarations are
additive; the total utility is the sum of those arising from different
declarations.
For example, the utility for getting a ticket and arriving late is
-110.
We can query IBAL for the optimal decision in this example by asking
for speed. IBAL indicates that the correct decision is for the
driver to speed with probability 1. Furthermore, IBAL reports an
expected utility to the driver of -0.38. If we also query
ticket, IBAL's solution has two rows, one for each value of
ticket. In addition to the probabilities of each of the rows,
IBAL reports a utility with each row. This is a conditional expected
utility, conditioned on the variables taking on the values specified
in the row.
In the previous example, the driver had no information on which to
make the decision. We can also imagine a case in which the driver has
the opportunity to spot a cop before making the decision. This is
described in speed2.ibl.
There is now a variable see_cop, describing the driver's noisy
cop sensor. The decision variable is now declared using the syntax
decide speed given see_copIf we now query
see_cop and speed, we see that the
optimal decision depends on whether or not the driver sees a cop, in
the natural way. IBAL provides the probability of each case
happening, and the expected utility to the driver in each case.
In many decision problems, the agent gets to make a sequence of
decisions. Information received as a result of earlier decisions is
available to later decisions. File speed3.ibl shows such an
example.
Here the agent can look for a cop, increasing the probability of
seeing one, before actually deciding whether or not to speed.
Looking for a cop has a small cost associated with it.
Although look_for_cop is not listed as a given piece of
information for speed, IBAL makes the standard ``no
forgetting'' assumption that all previous decisions, and all
information available to them, are available information when a future
decision is made.
Thus the speed decision can take both
the look_for_cop decision and the value of see_cop into
account.
To solve a sequential decision problem in this form, IBAL uses the
backward induction algorithm.
First the speed decision is worked out, for all possible value
of look_for_cop and see_cop. Once that has been done,
speed can be turned into an ordinary variable, and the
look_for_cop decision can be solved.
Sequential decision problems can be naturally stated in a recursive
way. This forms the basis for the Markov Decision Process (MDP)
framework.
File speed4.ibl shows an MDP built from our example.
There are two possible states of the world: the driver may be
'early or 'late.
The mdp function takes a time horizon n and the current
state as arguments.
The driver will eventually receive a positive or negative reward at
horizon 0, depending on being early or late. Otherwise, if the horizon
is greater than 0, the driver has an opportunity to speed. Once
again, the driver can look for a cop and base the decision to speed on
whether or not a cop is seen.
Looking for a cop has a small cost, and getting a speeding ticket has
a large one, but speeding increases the likelihood of moving to the
'early state at the next time step.
There are a number of syntactic features worth discussing in this example. First, don't be alarmed by the inconsistent use of semicolons after declarations -- they are optional. Second, note the way the reward for arriving early is stated:
reward 10.0 in errorPreviously, we saw rewards in declarations, but rewards, as well as other declarations can take an expression form.
reward r in e means the same as expression e, except
that a utility of r is added to the outcome.
In our example, the expression error is used because the
mdp function as a whole returns some values that only exist at
a horizon greater than 0. This expression only exists to specify the
rewards; we will never actially be interested in the result value.
Third, it is possible to discount the utility of an expression using
the discount d in e. This has the same meaning as expression
e, except that the utility of e is multiplied by
discount factor d.
Technically speaking, our example is not really a fully observed
MDP since the presence of a cop is unknown, and the agent has the
ability to gather information. However, the unknown information does
not persist from one time step to the next. The presence of a cop at
one time point does not affect the presence of one at the next time
point. What if we wanted a more complex model in which the unobserved
information did persist? Unfortunately there is no way to do that in
IBAL at the moment. IBAL makes the restriction that if a decision is
declared inside a function, then all arguments to the function are
known to the decision maker. (This explains why we didn't have to
write given state in the declaration of the decisions.)
This restriction is necessary in order to be able to solve functions
in a modular way. Unfortunately, it makes it impossible to express
general partially observable MDPs.
When we query the program, it returns the optimal policy with n
steps to go starting from a given state.
IBAL performs this computation in time linear in n.
The solution is in the form of a triple: the look_for_cop
decision, the see_cop information, and the speed
decision.
The returned solution will generally have two rows, for the two
possible states of see_cop.
Each row will have the same value for look_for_cop, because
that decision is taken before the see_cop observation.
The row will then indicate the speed decision for the given
value of see_cop.
One could have imagined having the mdp function return a
full-fledged history rather than simply the policy for the first
step. One might then write a function to extract relevant portions of
the history. Unfortunately, such an approach would have been
computationally infeasible. IBAL's ability to solve MDPs in linear
time relies on its ability to eliminate each step after it has
finished solving it. Making the mdp function return a history
would have made it impossible to eliminate individual time steps.
The result is that IBAL tries to compute, as an intermediate step, a
probability distribution over histories, whose size is exponential in
the history.
For technical reasons, the decision making component of IBAL is only available with lazy evaluation.