CS121

CS121

Introduction to Artificial Intelligence

Winter 2004

Homework #6

(Markov Decision Processes)

Out: 2/23/04 — Due: 3/1/04

How to complete this HW: This homework is a programming assignment to be done individually. You can download the C starter files from /afs/ir.stanford.edu/class/cs121/hw6. After writing the program you should submit using the submission script at /afs/ir.stanford.edu/class/cs121/bin/submit6. You must submit before 3/1 at midnight. You must make sure your program works on the SUN solaris machines, as that is where we’ll be grading the programs. Your submission should include a README that includes your name, e-mail, SUNet ID, answers to the questions, and any comments you might have. Note that using C++ and STL is permitted (as long as you change the Makefile accordingly).

Note on Honor Code: You may discuss this program with other students in the class and/or look into other documents (books, web sites), with the exception of published solutions, without taking any written or electronic notes. However, all code and answers must be your own work. If you have discussed the program with other students, indicate their name(s) in the README. Any intentional transgression of these rules will be considered an honor code violation.

General Information

This programming assignment will allow you to experiment with the value iteration algorithm to calculate utilities and optimal policies for Markov decision processes (MDPs). This assignment will also allow you to test the policy iteration algorithm. The project is based on the 4 × 3 environment described in Russell & Norvig (2nd ed.) on pages 613-621. While reading the book might be helpful in understanding this project, all of the necessary information is contained in this handout.

Problem Description

Suppose our agent is a mouse that is situated in the 4 × 3 environment shown in Fig. 1. Beginning at some initial state, the mouse must choose one of the available actions, namely, up, down, left t, and right. These actions allow the mouse to transition between cells. Two of the cells are terminal states. One of the terminal cells (marked by +1) contains cheese. If the mouse enters the cell with cheese, the mouse will get the cheese, receive a reward of +1, and go to heaven. The other terminal cell is marked by -1. If the mouse enters this cell, it falls down a hole never to escape and receives a reward of -1. The reward for being in any of the other cells is -0.04.

			1
	X		-1

Figure 1: Our 4 × 3 mouse world environment.

We shall assume that the environment is completely observable to our mouse. However, due to the fact that the floor is slippery and our mouse is wearing socks, we cannot assume that the mouse's actions will always have the desired effect. In particular, the mouse's actions will have the intended result only 80% of the time. The remaining 20% of the time, the mouse will slip in a direction perpendicular to its intended direction. If the mouse intentionally or unintentionally walks into a wall, the effect will be that the mouse remains in the same cell. The mouse may not enter the “X” cell. There are eleven states in this problem, s₀,… , s₁₀ , each corresponding to the mouse inhabiting one of the eleven cells. We will number the cells from left to right, bottom to top. The objective is to develop the optimal policy specifying which action to take from each state to maximize the expected utility.

Value Iteration

In the previous section, we informally specified the transition model T(s,a,s’), which is the probability of reaching state s’ after taking action a in state s. We also specified the reward function R(s). The transition model and reward function define a MDP, and we may use the Value-Iteration algorithm to approximate the utility function U(s) over all the states with a discount g and with an error bound of e.

The Value-Iteration algorithm appears on page 621 of the text. Our algorithm will take as input the environment, the transition model, the reward function, the discount factor g, and a e which will be used to decide when the algorithm has converged enough. Formally:

Value-Iteration(S,T,R,g,e)

For all s do: U’(s) = 0

repeat

U = U’

d = 0

For all s do:

U’(s) = R(s) + g . max_a S_s’ T(s,a,s’) U(s’)

d = max{d , |U’(s) – U(s)|}

until (d<=e(1-g)/g)

return U

Once we calculate the utility of each state, we may then determine the optimal policy according

to the following formula:

P*(s) = arg max_a S_s’T(s,a,s’) U(s’)

Policy Iteration

Policy-Iteration appears on pages 624-625 of the text. We will implement the modified Policy-Iteration algorithm. The modified Policy-Iteration algorithm approximates the utility function given a policy by iterative computation instead of solving for the exact values using a linear system solver. You should experiment to determine the number of iterations, K, that is needed to approximate the exact values. It takes as input the environment, the transition model, the reward function, and the discount factor g, and finally it outputs a policy. Formally:

Policy-Iteration(S,T,R,g)

Pick a policy P’ at random

Repeat

P=P’

Repeat K times:

For all s do: U’(s) = R(s) + g . S_s’ T(s, P(s), s’) U(s’)

U = U’

For all s do: P’(s) = arg max_a S_s’T(s,a,s’) U(s’)

Until(P=P’)

Return P

Getting Started

We have provided some C stub code as well as a Makefile; they are available at /afs/ir.stanford.edu/class/cs121/hw6. While it is your prerogative to do your development on any platform, it is your responsibility to make sure that your solution works on a SUN solaris machine (tree, saga, myth, Elaine, fable, epic, etc.), as we will be testing your programs on one of these.

You can compile the program and create the executable by entering:

Fable10:~> make

The name of the program is mdp.

The C starter code is available here: /usr/class/cs121/hw6. The only file you should change is mdp.c. Much of the coding has been done for you. You will need to add around 30 lines of code to this file to implement the functions Value-Iteration and Policy-Iteration.

Each state is represented by an integer from 0 to 10, numbered from left to right, bottom to top. The actions are represented by integers 0 to 3, corresponding to up, down, left, and right respectively. The reward function is represented by a 1-D array called reward, and the transition model is represented by a 3-D array called transition. The values for these arrays are read in to memory by the starter code from the files mousemap.reward and mousemap.transition. Please read the code in GetTransition and GetReward to see how these files work. Running the program will print out the results to the screen and create two files, “value.out” and “policy.out” containing the results.

You may wish to code an arbitrarily large limit to the number of iterations in your program to avoid infinite loops due to programming errors. A correctly implemented program will not loop infinitely and will converge to a solution on the provided transition and reward files.

Questions

Answer the following questions:

1. How does the discount factor affect the number of iterations required for convergence in the two algorithms? For Policy-Iteration, experiment with several random initial policies for each discount and use the average number of iterations.

2. How does the optimal policy change when the discount factor is increased from 0.9 to 1.0 in Value-Iteration?

3. Do an experiment where you change the reward function (you will need to change the file mousemap.reward). Describe your results with both algorithms.

4. For this part of the problem set g = 0.9. Let p be the probability that an action has its intended effect. The default mousemap.transition file has p = 0.8. Experiment both algorithms with changing the value of p, but keep the probability that the mouse slides set to (1-p)/2 in each direction. You will need to experiment with the mousemap.transition file to find a value of p such that the optimal policy changes. What value of p did you use? What are these new policies for each algorithm? Explain why changing p changes the policy.

5. Discuss the differences between the results of Policy-Iteration and Value-Iteration on the following: policies found and utilities found.

Deliverables

Your program will be tested for correctness for various maps. The correctness is worth 60 points. Correctness testing will be done with scripts. Your program must compile and run on the specified grading machines without us making any modifications. If it does not, you will get no correctness points. 30 points will be awarded for correctly answering the questions and 10 points will be awarded for code quality. (Your code will need to be exceptionally badly designed to lose these points, so don’t stress too much about this.) In addition to the program code, you should supply a README that includes your name, e-mail, SUNet ID, answers to the questions, and any comments you might have.

To submit your assignment, clean out all object files and executables from your directory, so that all you are submitting is the Makefile, code, and README, and issue the following command, and follow the instructions of the script at:

/afs/ir.stanford.edu/class/cs121/bin/submit6

Please only submit from inside your working directory. Try not to submit sample input or output files.

A complete project will consist of the following:

• Your answers to the questions.

• Your source code (mdp.c), with all other files needed to correctly compile and execute your program.