## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
Bharat Tak
@devbharat
?
gonond
@gonond
ah nothing, just annoying exercise
Bharat Tak
@devbharat
oh! I thought you had a eureka moment
gonond
@gonond
:) no...
Bharat Tak
@devbharat
regarding what you said about drifting
 % Convert to index of successor state (p1, v1)
s = state_c2d([p0; v0]);
sp = state_c2d([p1; v1]);
        % Update the model with the iteration's simulation results
% Count how many times sp is reached from s

Task.R_s_a(s,a) = Task.R_s_a(s,a) + r/i;
now i update s and sp!
this should be ok right ?
%% Step 2: Generate the discrete state/action space MDP model
for a = Task.A   % loop over the actions
fprintf('Discrete system model for action a = %6.4f \n', U(a));

for s = Task.S  % loop over states
p1 = X(1,s);
v1 = X(2,s);
for i = 1:Parameters.modeling_iter % loop over modeling iterations
p0 = p1 + ((rand - 0.5))*delta_x;%(i-1)*delta_x/(Parameters.modeling_iter) - 0.5*delta_x;   % position
v0 = v1 + ((rand - 0.5))*delta_v;%(i-1)*delta_v/(Parameters.modeling_iter) - 0.5*delta_v;   % velocity
action = U(:,a); % inputs

%Simulate for one time step. This function inputs and returns
%states expressed by their physical continuous values. You may
%want to use the included state_*2* functions provided to do
%this conversion.
[p1,v1,r,isTerminalState] = Mountain_Car_Single_Step(p0,v0,action); % Note: isTerminalState is nowhere needed in this scope

% Convert to index of successor state (p1, v1)
si = state_c2d([p0; v0]);
sp = state_c2d([p1; v1]);

% Update the model with the iteration's simulation results
% Count how many times sp is reached from s

end % modeling_iter
end
end
gonond
@gonond
I meant you drift away because you dont consider s as the current state anymore when you use the output of the simulation sp as the next s
Bharat Tak
@devbharat
yes, but i update the matrix correctly
gonond
@gonond
ah i see
Bharat Tak
@devbharat
you think it makes sense
?
gonond
@gonond
my program just finished execution, the result goes into the right direction it just doesnt bounce as high as before anymore but it doesnt reach the right end, probably i have to decrease the penalty
Bharat Tak
@devbharat
yes
that might work
gonond
@gonond
I mean your idea is worth a try aswell
just started it again with penalty r=-5
Bharat Tak
@devbharat
it running
gonond
@gonond
ok
still too large penalty, i try r=-2
Bharat Tak
@devbharat
ohk
gonond
@gonond
yeah, it worked!
Bharat Tak
@devbharat
great! mine is hardly moving...
gonond
@gonond
ok, I guess we are done with the coding part
Bharat Tak
@devbharat
yaa
gonond
@gonond
well the provided solution is still better, it is faster
but I think we can leave it like this
Bharat Tak
@devbharat
can you mail me the zip code, maybe my other files are not correct...i still dont see why is only slightly ossilating with my implementation
gonond
@gonond
ok
Bharat Tak
@devbharat
thanks!
gonond
@gonond
so just the 2 design files rigtht?
Bharat Tak
@devbharat
yes
gonond
@gonond
there you go
Bharat Tak
@devbharat
something seems off, it is oscillating very slowly, are you sure you mailed the correct file
gonond
@gonond
Maybe the parameters in your main are not the same as mine... i used 20,20 state bins and 5 action bins and 20 modeling iterations
the files should be the correct ones
Bharat Tak
@devbharat
ah yes
it get very close,
almost
gonond
@gonond
doesnt it reach the goal?
Bharat Tak
@devbharat
i did not in that try, but it was very close,
there is some randomness, so maybe we are at an edge case
gonond
@gonond
ok funny... probably some fine tuning necessary
gonond
@gonond
hey, I would like to submit soon, I will post my answers, so we can discuss them ok?

Q1: The stochastic element is the difference between the actual state and the nearest bin center, i.e. the discretization error. Since in the Transition Probability Matrix all the states within the same bin are represented by the same discrete state one has to give a certain probability to different state transitions from a given discrete state and action although the dynamics are deterministic.
This becomes even more important when the number of used bins is decreased.

Q2: The optimal solution is to apply a periodic control input with its frequency equal to the resonance frequency of the system, which can be seen as an excited oscillator. The algorithm is indeed able to find this solution.

Q3: When using the deterministic discretized model the probability matrix does not reflect reality in the sense that it treats certain state transitions as impossible, while they actually are possible. A way to overcome this problem could be to use a very fine discretization, but then the computational costs would become very high. However with fewer bins, the algorithm is unable to find a solution.

Q4: Epsilon trades of exploration vs. exploitation of the current policy. Increasing epsilon leads to much larger and more frequent variations of the accumulated reward obtained in subsequent episodes. Furthermore a too large epsilon (say 0.45) makes it impossible to reach the goal.

Q5: The final solution does not change.

Q6: The Q-Learning algorithm is able to find the shortest and therefore optimal path and it even takes less time for doing so compared to the Monte Carlo algorithm. The difference in speed is due to the fact that in Q-Learning only local information is used to update Q, whereas for Monte Carlo the rewards of a full episode have to be “back-propagated”. M.C. can also not find the optimal solution because it is an on-policy method and walking close to the cliff is risky, which makes it difficult to learn it.

Bharat Tak
@devbharat
Yes
I am still running modified versions of your code :P
Q1: what parameters have most effect ?