%% Step 2: Generate the discrete state/action space MDP model
for a = Task.A % loop over the actions
fprintf('Discrete system model for action a = %6.4f \n', U(a));
for s = Task.S % loop over states
p1 = X(1,s);
v1 = X(2,s);
for i = 1:Parameters.modeling_iter % loop over modeling iterations
p0 = p1 + ((rand - 0.5))*delta_x;%(i-1)*delta_x/(Parameters.modeling_iter) - 0.5*delta_x; % position
v0 = v1 + ((rand - 0.5))*delta_v;%(i-1)*delta_v/(Parameters.modeling_iter) - 0.5*delta_v; % velocity
action = U(:,a); % inputs
%Simulate for one time step. This function inputs and returns
%states expressed by their physical continuous values. You may
%want to use the included state_*2* functions provided to do
%this conversion.
[p1,v1,r,isTerminalState] = Mountain_Car_Single_Step(p0,v0,action); % Note: isTerminalState is nowhere needed in this scope
% Convert to index of successor state (p1, v1)
si = state_c2d([p0; v0]);
sp = state_c2d([p1; v1]);
% Update the model with the iteration's simulation results
% Count how many times sp is reached from s
Task.P_s_sp_a(si,sp,a) = Task.P_s_sp_a(si,sp,a) + 1/i;
Task.R_s_a(si,a) = Task.R_s_a(si,a) + r/i;
end % modeling_iter
end
end
Q1: The stochastic element is the difference between the actual state and the nearest bin center, i.e. the discretization error. Since in the Transition Probability Matrix all the states within the same bin are represented by the same discrete state one has to give a certain probability to different state transitions from a given discrete state and action although the dynamics are deterministic.
This becomes even more important when the number of used bins is decreased.
Q2: The optimal solution is to apply a periodic control input with its frequency equal to the resonance frequency of the system, which can be seen as an excited oscillator. The algorithm is indeed able to find this solution.
Q3: When using the deterministic discretized model the probability matrix does not reflect reality in the sense that it treats certain state transitions as impossible, while they actually are possible. A way to overcome this problem could be to use a very fine discretization, but then the computational costs would become very high. However with fewer bins, the algorithm is unable to find a solution.
Q4: Epsilon trades of exploration vs. exploitation of the current policy. Increasing epsilon leads to much larger and more frequent variations of the accumulated reward obtained in subsequent episodes. Furthermore a too large epsilon (say 0.45) makes it impossible to reach the goal.
Q5: The final solution does not change.
Q6: The Q-Learning algorithm is able to find the shortest and therefore optimal path and it even takes less time for doing so compared to the Monte Carlo algorithm. The difference in speed is due to the fact that in Q-Learning only local information is used to update Q, whereas for Monte Carlo the rewards of a full episode have to be “back-propagated”. M.C. can also not find the optimal solution because it is an on-policy method and walking close to the cliff is risky, which makes it difficult to learn it.