preview

An Analysis Of 18 Iterations

Decent Essays

It was seen that the best policy converged in 18 iterations with an average reward of -19 points. 2. Stage1 [Q-learning]: Experiment results The RL agent whose only motive was to maximize its reward exhibited different behavior when the rewarding process and learning hyperparameters were changed. Though a variety of experiments were conducted to arrive at an appropriate training mechanism, only a summary of agent’s performance on those experiments are listed below. It may be noted that seeding was used in the program to generate consistent results. i. Experiments with learning rate Learning rate and Speed of convergence: Inverse relationship was found between learning rate and speed of convergence. Higher the learning rate, quicker was the …show more content…

Figure 43: Q-learning experiments with different policies: (a) Varied penalty for convergence duration (b) Varied penalty for revisiting same cell PAGE 61 iv. Experiments with discount rate Convergence was quicker with higher discount rates, thanks to the higher valuation of long term returns available to the agent. When discount rate ɤ was lesser, the agent didn’t have much visibility of future benefits. It was myopic and had to take decisions based on immediate reward. Figure 44: Q-learning experiments with different discount rates 3. Stage2 [Q-network]: Q-function approximation Function approximation with neural network was successfully implemented for path finding problem. The below table summarizes the observations upon comparing it to Qlearning. Figure 45: Comparison of experiment results between Q-network and Q-learning Stage2 experiments were almost similar to Stage1 experiments as all tests were done by varying the hyperparameters. The results did not differ much but the noticeable difference was the reduction of loss and quicker convergence (highlighted below) as the model was retrained over different epochs. Lastly, both models were successful in identifying the optimal policy even though Q-network took a little longer. Figure 46: Loss reduction with Q-function training over various epochs PAGE 62 Elaboration of all experiments is not done as a flavor of these experiments is already conveyed as part of Stage 1 experiments and the only

Get Access