Q1: The stochastic element is the difference between the actual state and the nearest bin center, i.e. the discretization error. Since in the Transition Probability Matrix all the states within the same bin are represented by the same discrete state one has to give a certain probability to different state transitions from a given discrete state and action although the dynamics are deterministic.
This becomes even more important when the number of used bins is decreased.
Q2: The optimal solution is to apply a periodic control input with its frequency equal to the resonance frequency of the system, which can be seen as an excited oscillator. The algorithm is indeed able to find this solution.
Q3: When using the deterministic discretized model the probability matrix does not reflect reality in the sense that it treats certain state transitions as impossible, while they actually are possible. A way to overcome this problem could be to use a very fine discretization, but then the computational costs would become very high. However with fewer bins, the algorithm is unable to find a solution.
Q4: Epsilon trades of exploration vs. exploitation of the current policy. Increasing epsilon leads to much larger and more frequent variations of the accumulated reward obtained in subsequent episodes. Furthermore a too large epsilon (say 0.45) makes it impossible to reach the goal.
Q5: The final solution does not change.
Q6: The Q-Learning algorithm is able to find the shortest and therefore optimal path and it even takes less time for doing so compared to the Monte Carlo algorithm. The difference in speed is due to the fact that in Q-Learning only local information is used to update Q, whereas for Monte Carlo the rewards of a full episode have to be “back-propagated”. M.C. can also not find the optimal solution because it is an on-policy method and walking close to the cliff is risky, which makes it difficult to learn it.