Article: Automation of noise sampling in deep reinforcement learning Journal: International Journal of Applied Pattern Recognition (IJAPR) 2022 Vol.7 No.1 pp.15 - 23 Abstract: The actor-critic models are generally prone to overestimation of sub-optimal policies and Q-values. Our proposed approach is established on value-based deep reinforcement learning algorithm also known as twin delayed deep deterministic policy gradient algorithm or TD3. The suggested approach is used to solve complex reinforcement learning problem like half-humanoid robot, ant, and half-cheetah to cover a path. This problem can only be solved with an algorithm which can work on continuous-action spaces, without much delaying the result to propagate during the inference of model. The proposed model has been adapted to converge faster to optimal Q-values. The TD3 uses two deep neural networks for learning two Q-values, viz., Q1 and Q2; in the proposed approach the Q-values average is being taken as an input for final Q-value unlike the other reinforcement learning algorithm such as DDPG which is prone to overestimate the Q-values. The proposed approach has also made self-adjusting noise clipping function, which make it harder for the policy to exploit Q-function errors to further improve performance. Inderscience Publishers - linking academia, business and industry through research

Title: Automation of noise sampling in deep reinforcement learning

Authors: Kunal Karda; Namit Dubey; Abhas Kanungo; Varun Gupta

Addresses: Department of Computer Science and Engineering, Acropolis Technical Campus, Indore, India ' Department of Computer Science and Engineering, Acropolis Technical Campus, Indore, India ' Department of Electronics and Instrumentation Engineering, KIET Group of Institutions, Delhi – NCR, Ghaziabad – 201206, UP, India ' Department of Electronics and Instrumentation Engineering, KIET Group of Institutions, Delhi – NCR, Ghaziabad – 201206, UP, India

Abstract: The actor-critic models are generally prone to overestimation of sub-optimal policies and Q-values. Our proposed approach is established on value-based deep reinforcement learning algorithm also known as twin delayed deep deterministic policy gradient algorithm or TD3. The suggested approach is used to solve complex reinforcement learning problem like half-humanoid robot, ant, and half-cheetah to cover a path. This problem can only be solved with an algorithm which can work on continuous-action spaces, without much delaying the result to propagate during the inference of model. The proposed model has been adapted to converge faster to optimal Q-values. The TD3 uses two deep neural networks for learning two Q-values, viz., Q1 and Q2; in the proposed approach the Q-values average is being taken as an input for final Q-value unlike the other reinforcement learning algorithm such as DDPG which is prone to overestimate the Q-values. The proposed approach has also made self-adjusting noise clipping function, which make it harder for the policy to exploit Q-function errors to further improve performance.

Keywords: TD3; Q-values; deep neural networks; half-humanoid robot; ant; half-cheetah; reinforcement learning.

DOI: 10.1504/IJAPR.2022.122261

International Journal of Applied Pattern Recognition, 2022 Vol.7 No.1, pp.15 - 23

Received: 28 Sep 2020
Accepted: 02 Jul 2021
Published online: 14 Apr 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Automation of noise sampling in deep reinforcement learning

Keep up-to-date