多智能体深度确定性策略梯度算法研究与改进

劳天成; 刘义; 范文慧

doi:10.13568/j.cnki.651094.651316.2023.05.08.0001

您当前的位置：

首页 >

文章列表页 >

多智能体深度确定性策略梯度算法研究与改进

更新时间：2026-01-21

- 多智能体深度确定性策略梯度算法研究与改进
- Journal of Xinjiang University (Natural Science Edition in Chinese and English) Vol. 40, Issue 6, Pages: 717-723(2023)
- 作者机构：
  
  清华大学自动化系
- 作者简介：
- 基金信息：
- DOI：10.13568/j.cnki.651094.651316.2023.05.08.0001
  CLC： TP18
- Published：2023
- 稿件说明：
移动端阅览
[1]劳天成,刘义,范文慧.多智能体深度确定性策略梯度算法研究与改进[J].新疆大学学报(自然科学版)(中英文),2023,40(06):717-723.
[1]劳天成,刘义,范文慧.多智能体深度确定性策略梯度算法研究与改进[J].新疆大学学报(自然科学版)(中英文),2023,40(06):717-723. DOI： 10.13568/j.cnki.651094.651316.2023.05.08.0001.

DOI：10.13568/j.cnki.651094.651316.2023.05.08.0001.

摘要

针对多智能体深度确定性策略梯度算法（MADDPG）在某些场景下，尤其是在部分可观察环境与稀疏奖励条件下，不一定能学习到最优策略的问题，采用观察叠加法和在深度网络中加入长短期记忆网络（Long Short-Term Memory

LSTM）层的方法对MADDPG算法进行了改进，通过含遮蔽区的捕食者-猎物场景验证了改进的算法在智能体决策上的有效性；引入后验经验回放（Hindsight Experience Replay

HER）方法对MADDPG算法进行了改进，通过合作通讯场景和合作导航场景的对比实验验证了改进的算法能够使智能体获得的高价值经验大大增加，可以提高MADDPG算法收敛速度，有助于智能体学习到最优策略．

Abstract

In order to solve the problem that the Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm may not be able to achieve the optimal policy in some scenarios

especially in partially observable environments and sparse reward conditions

this paper adopts the observation stacking method and introduces LSTM layers in deep networks to improve the algorithm and the effectiveness in decision-making of the improved algorithm is verified on the predator-prey scenario with sheltered area. Besides

HER method is introduced to improve the algorithm. The experiment on the cooperative communication scenario and the cooperative navigation scenario validates that the improved algorithm can actually enable the agents to obtain more high-value experience

which speeds up the convergence and makes it easier to learn the optimal policy.

关键词

Keywords

references

李茹杨,彭慧民,李仁刚,等.强化学习算法与应用综述[J].计算机系统应用, 2020, 29(12):13-25.

WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning,1992, 8(3):229-256.

SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille:PMLR, 2015.

MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518:529-533.

WATKINS C J C H, DAYAN P. Q-learning[J]. Machine Learning, 1992, 8(3):279-292.

孙彧,曹雷,陈希亮,等.多智能体深度强化学习研究综述[J].计算机工程与应用, 2020, 56(5):13-24.

LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2017.

ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2017.

孙彧,徐越,潘宣宏,等.基于后验经验回放的MADDPG算法[J].指挥信息系统与技术, 2021, 12(6):78-84.

HAUSKNECHT M J, STONE P. Deep recurrent Q-learning for partially observable MDPs[C]//2015 AAAI Fall Symposium. Palo Alto:AAAI, 2015.

HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.

GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Fort Lauderdale:PMLR, 2011.

SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille:PMLR, 2015.

Views

340

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰