Title: A novel double-mGBDT-based Q-learning

Authors: Qiming Fu; Shuai Ma; Dawei Tian; JianPing Chen; Zhen Gao; Shan Zhong

Addresses: School of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Jiangsu Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China ' School of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Jiangsu Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China ' School of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Jiangsu Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China ' Jiangsu Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China; Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou, Jiangsu, 215009, China ' Faculty of Engineering, McMaster University, Hamilton, L8S 0A3, Canada ' Institute of Computer Science, Changshu Institute of Technology, Changshu, Jiangsu, 215500, China

Abstract: This paper proposes a novel double-mGBDT-based Q-learning algorithm. Compared with traditional deep reinforcement learning, the proposed algorithm uses the mGBDT to replace the DNN, where the mGBDT is introduced as the function approximator. In the learning process, based on the state, we use the Bellman equation to construct the target value, which is used to train the mGBDT in an online manner. Like DQN, we also adopt two mGBDT frameworks, which are used to address the problem of easy divergence. To verify performance, we apply the proposed algorithm DQN and mGBDT to the traditional benchmark problems in CartPole and MountainCar. The results show that the proposed algorithm can converge to the optimal policy, and compared with DQN, the proposed algorithm's stability is much better after convergence.

Keywords: deep learning; reinforcement learning; mGBDT.

DOI: 10.1504/IJMIC.2021.121827

International Journal of Modelling, Identification and Control, 2021 Vol.37 No.3/4, pp.232 - 239

Received: 27 Oct 2020
Accepted: 06 Jan 2021

Published online: 07 Apr 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article