CS121
Introduction to Artificial Intelligence
Winter 2004
Homework #6
(Markov Decision Processes)
Out:
How to
complete this HW: This homework is a programming assignment to be done
individually. You can download the C
starter files from /afs/ir.stanford.edu/class/cs121/hw6. After writing the program you should submit
using the submission script at /afs/ir.stanford.edu/class/cs121/bin/submit6.
You must submit before 3/1 at
Note on Honor
Code: You
may discuss this program with other students in the class and/or look into
other documents (books, web sites), with the exception of published solutions,
without taking any written or electronic notes.
However, all code and answers must be your own work. If you have
discussed the program with other students, indicate their name(s) in the
README. Any intentional transgression of these rules will be considered an
honor code violation.
General
Information
This programming assignment will allow you to
experiment with the value iteration algorithm to calculate utilities and
optimal policies for Markov decision processes (MDPs).
This assignment will also allow you to test the policy iteration algorithm. The
project is based on the 4 Χ 3 environment described in Russell & Norvig (2nd ed.) on pages 613-621. While reading the book
might be helpful in understanding this project, all of the necessary
information is contained in this handout.
Problem
Description
Suppose our agent is a mouse that is situated in the
4 Χ 3 environment shown in Fig. 1. Beginning at some initial state, the mouse
must choose one of the available actions, namely, up, down, left t, and right.
These actions allow the mouse to transition between cells. Two of the cells are
terminal states. One of the terminal cells (marked by +1) contains cheese. If
the mouse enters the cell with cheese, the mouse will get the cheese, receive a
reward of +1, and go to heaven. The other terminal cell is marked by -1. If the
mouse enters this cell, it falls down a hole never to escape and receives a
reward of -1. The reward for being in any of the other cells is -0.04.
|
|
|
1 |
|
X |
|
-1 |
|
|
|
|
Figure 1: Our 4 Χ 3 mouse world environment.
We shall assume that the environment is completely
observable to our mouse. However, due to the fact that the floor is slippery
and our mouse is wearing socks, we cannot assume that the mouse's actions will
always have the desired effect. In particular, the mouse's actions will have
the intended result only 80% of the time. The remaining 20% of the time, the
mouse will slip in a direction perpendicular to its intended direction. If the
mouse intentionally or unintentionally walks into a wall, the effect will be that
the mouse remains in the same cell. The mouse may not enter the X cell. There
are eleven states in this problem, s0,
, s10
, each corresponding to the mouse inhabiting one of the eleven cells. We will
number the cells from left to right, bottom to top. The objective is to develop
the optimal policy specifying which action to take from each state to maximize
the expected utility.
Value
Iteration
In the previous section, we informally specified the
transition model T(s,a,s), which is the probability of reaching state s after
taking action a in state s. We also specified the reward function R(s). The
transition model and reward function define a MDP, and we may use the
Value-Iteration algorithm to approximate the utility function U(s) over all the
states with a discount g and with an error bound of e.
The Value-Iteration algorithm appears on page 621 of
the text. Our algorithm will take as input the environment, the transition
model, the reward function, the discount factor g, and a e which will be used to
decide when the algorithm has converged enough. Formally:
Value-Iteration(S,T,R,g,e)
For all s do: U(s) = 0
repeat
U = U
d = 0
For all s do:
U(s) = R(s) + g . maxa Ss T(s,a,s) U(s)
d = max{d , |U(s) U(s)|}
until (d<=e(1-g)/g)
return U
Once we calculate the utility of each state, we may
then determine the optimal policy according
to the following formula:
P*(s) = arg
maxa Ss T(s,a,s) U(s)
Policy
Iteration
Policy-Iteration appears on pages 624-625 of the
text. We will implement the modified Policy-Iteration algorithm. The modified Policy-Iteration algorithm
approximates the utility function given a policy by iterative computation
instead of solving for the exact values using a linear system solver. You should experiment to determine the number
of iterations, K, that is needed to approximate the
exact values. It takes as input the
environment, the transition model, the reward function, and the discount factor
g, and
finally it outputs a policy. Formally:
Policy-Iteration(S,T,R,g)
Pick a policy P at random
Repeat
P=P
Repeat
K times:
For all s do: U(s) = R(s) +
g . Ss T(s, P(s), s)
U(s)
U
= U
For
all s do: P(s) = arg maxa
Ss
T(s,a,s)
U(s)
Until(P=P)
Return P
Getting
Started
We have provided some C stub code as well as a Makefile; they are available at
/afs/ir.stanford.edu/class/cs121/hw6.
While it is your prerogative to do your development on any platform, it
is your responsibility to make sure that your solution works on a SUN solaris machine (tree, saga, myth, Elaine, fable, epic,
etc.), as we will be testing your programs on one of these.
You can compile the program and create the
executable by entering:
Fable10:~> make
The name of the program is mdp.
The C starter code is available here:
/usr/class/cs121/hw6. The only file you should change is mdp.c.
Much of the coding has been done for you. You will need to add around 30 lines
of code to this file to implement the functions Value-Iteration and
Policy-Iteration.
Each state is represented by an integer from 0 to
10, numbered from left to right, bottom to top. The actions are represented by
integers 0 to 3, corresponding to up, down, left, and right respectively. The
reward function is represented by a 1-D array called reward, and the transition
model is represented by a 3-D array called transition. The values for these
arrays are read in to memory by the starter code from the files mousemap.reward and mousemap.transition.
Please read the code in GetTransition and GetReward to see how these files work. Running the program
will print out the results to the screen and create two files, value.out and policy.out
containing the results.
You may wish to code an arbitrarily large limit to
the number of iterations in your program to avoid infinite loops due to
programming errors. A correctly
implemented program will not loop infinitely and will converge to a solution on
the provided transition and reward files.
Questions
Answer the following questions:
1. How does the discount factor affect the number of
iterations required for convergence in the two algorithms? For
Policy-Iteration, experiment with several random initial policies for each
discount and use the average number of iterations.
2. How does the optimal policy change when the
discount factor is increased from 0.9 to 1.0 in Value-Iteration?
3. Do an experiment where you change the reward
function (you will need to change the file mousemap.reward).
Describe your results with both algorithms.
4. For this part of the
problem set g = 0.9. Let p be the probability that an action has its intended effect.
The default mousemap.transition file has p = 0.8.
Experiment both algorithms with changing the value of p, but keep the
probability that the mouse slides set to (1-p)/2 in each direction. You will
need to experiment with the mousemap.transition file
to find a value of p such that the optimal policy changes. What value of p did
you use? What are these new policies for each algorithm? Explain why changing p
changes the policy.
5. Discuss the differences between the results of
Policy-Iteration and Value-Iteration on the following: policies found and
utilities found.
Deliverables
Your program will be tested for correctness for
various maps. The correctness is worth 60 points. Correctness testing will be
done with scripts. Your program must compile and run on the specified grading
machines without us making any modifications.
If it does not, you will get no correctness points. 30 points will be
awarded for correctly answering the questions and 10 points will be awarded for
code quality. (Your code will need to be exceptionally badly designed to lose
these points, so dont stress too much about this.) In addition to the program code, you should
supply a README that includes your name, e-mail, SUNet
ID, answers to the questions, and any comments you might have.
To submit your assignment, clean out all object
files and executables from your directory, so that all you are submitting is
the Makefile, code, and README, and issue the
following command, and follow the instructions of the script at:
/afs/ir.stanford.edu/class/cs121/bin/submit6
Please only submit from inside your working
directory. Try not to submit sample input or output files.
A complete project will consist of the following:
Your answers to the questions.
Your source code (mdp.c), with all other files needed to correctly compile and execute your program.