

浏览全部资源
扫码关注微信
清华大学自动化系
Published:2023
移动端阅览
[1]劳天成,刘义,范文慧.多智能体深度确定性策略梯度算法研究与改进[J].新疆大学学报(自然科学版)(中英文),2023,40(06):717-723.
[1]劳天成,刘义,范文慧.多智能体深度确定性策略梯度算法研究与改进[J].新疆大学学报(自然科学版)(中英文),2023,40(06):717-723. DOI: 10.13568/j.cnki.651094.651316.2023.05.08.0001.
DOI:10.13568/j.cnki.651094.651316.2023.05.08.0001.
针对多智能体深度确定性策略梯度算法(MADDPG)在某些场景下,尤其是在部分可观察环境与稀疏奖励条件下,不一定能学习到最优策略的问题,采用观察叠加法和在深度网络中加入长短期记忆网络(Long Short-Term Memory
LSTM)层的方法对MADDPG算法进行了改进,通过含遮蔽区的捕食者-猎物场景验证了改进的算法在智能体决策上的有效性;引入后验经验回放(Hindsight Experience Replay
HER)方法对MADDPG算法进行了改进,通过合作通讯场景和合作导航场景的对比实验验证了改进的算法能够使智能体获得的高价值经验大大增加,可以提高MADDPG算法收敛速度,有助于智能体学习到最优策略.
In order to solve the problem that the Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm may not be able to achieve the optimal policy in some scenarios
especially in partially observable environments and sparse reward conditions
this paper adopts the observation stacking method and introduces LSTM layers in deep networks to improve the algorithm and the effectiveness in decision-making of the improved algorithm is verified on the predator-prey scenario with sheltered area. Besides
HER method is introduced to improve the algorithm. The experiment on the cooperative communication scenario and the cooperative navigation scenario validates that the improved algorithm can actually enable the agents to obtain more high-value experience
which speeds up the convergence and makes it easier to learn the optimal policy.
李茹杨,彭慧民,李仁刚,等.强化学习算法与应用综述[J].计算机系统应用, 2020, 29(12):13-25.
WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning,1992, 8(3):229-256.
SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille:PMLR, 2015.
MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518:529-533.
WATKINS C J C H, DAYAN P. Q-learning[J]. Machine Learning, 1992, 8(3):279-292.
孙彧,曹雷,陈希亮,等.多智能体深度强化学习研究综述[J].计算机工程与应用, 2020, 56(5):13-24.
LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2017.
ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2017.
孙彧,徐越,潘宣宏,等.基于后验经验回放的MADDPG算法[J].指挥信息系统与技术, 2021, 12(6):78-84.
HAUSKNECHT M J, STONE P. Deep recurrent Q-learning for partially observable MDPs[C]//2015 AAAI Fall Symposium. Palo Alto:AAAI, 2015.
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Fort Lauderdale:PMLR, 2011.
SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille:PMLR, 2015.
0
Views
340
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621