next up previous
Next: About this document ... Up: IBAL Tutorial Previous: Learning

Decision Making

In addition to probabilistic models, IBAL allows the description of decision problems, which can then be solved using decision theory. A decision problem consists of a set of choices available to an agent, a probability distribution over different outcomes for each of the choices, and utilities achieved by the agent under different outcomes. The decision problem also specifies the information available to the agent.

Decision problems fit naturally into IBAL's generative language. The key difference is that a decision variable has no associated expression defining how it is generated. Instead, we assume that its value will be chosen so as to maximize the agent's expected utility. It is up to IBAL's inference engine to figure out what that value is!

Let us illustrate with a series of examples. The first example, in speed1.ibl describes a situation where a driver is late for an appointment, and must decide whether or not to speed. Speeding increases the likelihood of arriving on time, but runs the risk of a speeding ticket. The decision variable is defined using the declaration

decide speed
This simple declaration says that speed is a Boolean variable whose value will be chosen so as to maximize expected utility. A decision variable can also be a Symbol, declared with syntax such as
decide speed from 'yes, 'no
or an Integer, declared by
decide speed from 50 .. 90

The program speed1.ibl goes on to specify the consequences of speeding: the likelihood of arriving on time and getting a ticket. In order to complete the decision problem, we must specify the utility associated with each possible outcome. This is done with the following code:

reward case ticket of
  # true : -100.0
  # false : 0.0
reward case on_time of
  # true : 10.0
  # false : -10.0
The reward case form has similar syntax to a case expression, except that the item to the right of the colon is a floating point utility. In our example, there are actually two reward case declarations. The different declarations are additive; the total utility is the sum of those arising from different declarations. For example, the utility for getting a ticket and arriving late is -110.

We can query IBAL for the optimal decision in this example by asking for speed. IBAL indicates that the correct decision is for the driver to speed with probability 1. Furthermore, IBAL reports an expected utility to the driver of -0.38. If we also query ticket, IBAL's solution has two rows, one for each value of ticket. In addition to the probabilities of each of the rows, IBAL reports a utility with each row. This is a conditional expected utility, conditioned on the variables taking on the values specified in the row.

In the previous example, the driver had no information on which to make the decision. We can also imagine a case in which the driver has the opportunity to spot a cop before making the decision. This is described in speed2.ibl. There is now a variable see_cop, describing the driver's noisy cop sensor. The decision variable is now declared using the syntax

decide speed given see_cop
If we now query see_cop and speed, we see that the optimal decision depends on whether or not the driver sees a cop, in the natural way. IBAL provides the probability of each case happening, and the expected utility to the driver in each case.

In many decision problems, the agent gets to make a sequence of decisions. Information received as a result of earlier decisions is available to later decisions. File speed3.ibl shows such an example. Here the agent can look for a cop, increasing the probability of seeing one, before actually deciding whether or not to speed. Looking for a cop has a small cost associated with it. Although look_for_cop is not listed as a given piece of information for speed, IBAL makes the standard ``no forgetting'' assumption that all previous decisions, and all information available to them, are available information when a future decision is made. Thus the speed decision can take both the look_for_cop decision and the value of see_cop into account. To solve a sequential decision problem in this form, IBAL uses the backward induction algorithm. First the speed decision is worked out, for all possible value of look_for_cop and see_cop. Once that has been done, speed can be turned into an ordinary variable, and the look_for_cop decision can be solved.

Sequential decision problems can be naturally stated in a recursive way. This forms the basis for the Markov Decision Process (MDP) framework. File speed4.ibl shows an MDP built from our example. There are two possible states of the world: the driver may be 'early or 'late. The mdp function takes a time horizon n and the current state as arguments. The driver will eventually receive a positive or negative reward at horizon 0, depending on being early or late. Otherwise, if the horizon is greater than 0, the driver has an opportunity to speed. Once again, the driver can look for a cop and base the decision to speed on whether or not a cop is seen. Looking for a cop has a small cost, and getting a speeding ticket has a large one, but speeding increases the likelihood of moving to the 'early state at the next time step.

There are a number of syntactic features worth discussing in this example. First, don't be alarmed by the inconsistent use of semicolons after declarations -- they are optional. Second, note the way the reward for arriving early is stated:

reward 10.0 in error
Previously, we saw rewards in declarations, but rewards, as well as other declarations can take an expression form. reward r in e means the same as expression e, except that a utility of r is added to the outcome. In our example, the expression error is used because the mdp function as a whole returns some values that only exist at a horizon greater than 0. This expression only exists to specify the rewards; we will never actially be interested in the result value. Third, it is possible to discount the utility of an expression using the discount d in e. This has the same meaning as expression e, except that the utility of e is multiplied by discount factor d.

Technically speaking, our example is not really a fully observed MDP since the presence of a cop is unknown, and the agent has the ability to gather information. However, the unknown information does not persist from one time step to the next. The presence of a cop at one time point does not affect the presence of one at the next time point. What if we wanted a more complex model in which the unobserved information did persist? Unfortunately there is no way to do that in IBAL at the moment. IBAL makes the restriction that if a decision is declared inside a function, then all arguments to the function are known to the decision maker. (This explains why we didn't have to write given state in the declaration of the decisions.) This restriction is necessary in order to be able to solve functions in a modular way. Unfortunately, it makes it impossible to express general partially observable MDPs.

When we query the program, it returns the optimal policy with n steps to go starting from a given state. IBAL performs this computation in time linear in n. The solution is in the form of a triple: the look_for_cop decision, the see_cop information, and the speed decision. The returned solution will generally have two rows, for the two possible states of see_cop. Each row will have the same value for look_for_cop, because that decision is taken before the see_cop observation. The row will then indicate the speed decision for the given value of see_cop.

One could have imagined having the mdp function return a full-fledged history rather than simply the policy for the first step. One might then write a function to extract relevant portions of the history. Unfortunately, such an approach would have been computationally infeasible. IBAL's ability to solve MDPs in linear time relies on its ability to eliminate each step after it has finished solving it. Making the mdp function return a history would have made it impossible to eliminate individual time steps. The result is that IBAL tries to compute, as an intermediate step, a probability distribution over histories, whose size is exponential in the history.

For technical reasons, the decision making component of IBAL is only available with lazy evaluation.


next up previous
Next: About this document ... Up: IBAL Tutorial Previous: Learning
Avi Pfeffer 2006-11-19