#
Collaborative Cost Multi-Agent Decision-Making Algorithm with Factored-Value Monte Carlo Tree Search and Max-Plus^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- (1)
- Contingent upon the circumstance (state), different actions have different levels of relevance. Based on the current scenario, local-level decision-making determines the order of activities, from high likelihood to low probability in terms of effectiveness. Thus, increasing the likelihood that the best solution would be discovered early is a highly helpful step for the anytime method.
- (2)
- In a dynamic context, network segmentation is unavoidable since network connectivity is necessary for global-level optimization in multi-agent environments. In some circumstances (such as those involving cyberattacks or network problems), this communication may not always be assured. The local optimal solution in the first step may be the best option in such scenarios because the algorithm may not be able to reach the global optimal solution in such hostile situations.

- (a)
- In centralized arrangements, the central controller observes all agents jointly and makes joint choices for all agents. Each agent acts in response to the central controller’s decision. Failure or malfunction of the central controller is equal to the entire MAS failing.
- (b)
- To exchange information, the central controller must communicate with each agent, increasing the communication overhead at the single controller. This may reduce the scalability and robustness of MAS.
- (c)
- In a centralized setup (centralized controller), agents are not permitted to exchange information about state transitions or rewards with one another. A hybrid MAS coordination strategy could allow each agent to interact to make local and correlated decisions.

- We consider a budget constraint approach in which each action is assigned a cost. Different actions consume varying quantities of resources, which may be correlated to the team’s global payoff [11]. In such a scenario, the goal of the local decision maker is to optimize his decision under cost and budget constraints at any given time. Consequently, the global team reward at each time step is calculated by subtracting the total cost incurred in analyzing the cost of the local actions. In this manner, we extend previous works [1,2,3] on centralized coordination where the only budgetary constraint was time.
- We devise the Hybrid (i.e., centralized and distributed) coordination of the Max-Plus algorithm [3,4] where each agent computes and sends updated messages after receiving new and distinct messages from a neighbor. Messages are sent in parallel, which provides some computational advantages over the sequential execution of previous centralized coordination algorithms [1,2].
- We developed a new FV-MCTS-Hybrid Cost-Max-Plus decision-making method with two stages. Our contribution is the development of a theoretical framework for integrating Monte Carlo Tree Search (MCTS) with the Cost Hybrid Max-Plus algorithm for decision making and execution. The proposed method is a suboptimal Dec-POMDP solution. Even for two agents, it is known that the exact solution to a Dec-POMDP is intractable and complete non-deterministic exponential (NEXP) [12].

## 2. Related Work

- (1)
- To advance agents to the following global state, where the CG structure does not change over time, by performing a global simulation from the current global state.
- (2)
- To change each agent’s joint action choice only in the last stage of the algorithm (after the centralized Max-Plus algorithm’s convergence).

## 3. Mathematical Background

## 4. Both Variants of No Cost Max-Plus Algorithms

- Centralized Max-Plus Algorithm with no Cost

- B.
- Distributed Iterative Max-Plus Algorithm with no Cost

## 5. Distributed Max-Plus Algorithm with Cost

## 6. Hybrid Max-Plus Algorithm with Cost

- Actions for any agent ${A}_{i}{,a}_{i}\in {A}_{i}.$
- Initialization by the centralized coordinator for any agent ${\mu}_{ij}={\mu}_{ji}=0$, for any $\left(i,j\right)\in E$, ${a}_{i}\in {A}_{i}$, ${a}_{j}\in {A}_{j}$, and for any agent $i$, ${r}_{i}=0$ and $R\left(\mathit{a}\right)$.
- The costs of actions ${c}_{i}$ for any agent $i$ and the costs ${\mathcal{C}}_{i,j}$ for any pair of actions $\left({a}_{i},{a}_{j}\right)$.

Algorithm 1 Wait for segmentation message $\mathit{s}$ as shown in Figure 8 |

1. IF an agent receives the segmentation message $\mathit{s}=1$ Go to the Cost Centralized Max Plus algorithm given below://All agents that receive s = 1 |

2. WHILE the fixed point is not reached, time and cost budget are not reached//the centralized coordinator is evaluating this condition |

3. DO for any iteration $m$ |

4. FOR any agent $i$ |

5. FOR all neighbors $j\in \mathsf{\Gamma}(i)$ |

6. a. compute ${\mu}_{ij}({a}_{j})={Q}_{i}\left({a}_{i}\right)-{c}_{i}+{Q}_{ij}({a}_{i},{a}_{j})+{\sum}_{k\in \mathsf{\Gamma}\left(i\right)\backslash j}{\mu}_{ki}({a}_{i})-{\mathcal{C}}_{i,j}$ |

7. b. normalize the message ${\mu}_{ij}({a}_{j})$ |

8. c. send the message ${\mu}_{ij}({a}_{j})$ to the agent $j$ |

9. d. check if ${\mu}_{ij}({a}_{j})$ is closed to the previous message (equivalent to reaching the convergence) |

10. END FOR all neighbors |

11. Calculate by centralized coordinator |

${a}_{i}^{*}=\mathrm{arg}\text{}\underset{{a}_{i}}{\mathrm{max}}\left\{{0,\text{}[Q}_{i}\left({a}_{i}\right)-c({a}_{i})+{\sum}_{j\in \mathsf{\Gamma}\left(i\right)}{\mu}_{ji}({a}_{i})]\right\}$ |

12. Determine ${\mathit{a}}^{*}$, the optimal global action so far including all previous ${a}_{i}^{\prime}$ |

13. //Use anytime: |

14. IF $R(\mathit{a})\ge r$ THEN ${a}_{i}^{*}={a}_{i}^{\prime}$; $r=R({a}_{i}^{\prime}$) |

15. ELSE ${a}_{i}^{*}={a}_{i}^{\prime}$ |

16. END IF |

17. END FOR every agent $i$ |

18. END DO for any iteration $m$ |

19. END WHILE |

20. Return the global reward $R\left(\mathit{a}\right)$ |

21. ELSE IF//All agents that receive s = 0 |

22. Provide the Spanning Tree with algorithm in [15] and Go To the decentralized Max-Plus |

WHILE the fixed point is not reached, horizon time $T$, and cost budget are not reached |

//Root agent is evaluating this condition |

23. IF ${m}_{1}$ = (regular Max Plus typical message given by (4)) |

24. FOR any iteration $l$$,\text{}1\le l\le M+2\left(N-1\right)$ |

25. FOR all neighbors $j\in \mathsf{\Gamma}(i)$ |

a. compute ${\mu}_{ij}({a}_{j})\text{}$ |

${\mu}_{ij}({a}_{j})={Q}_{i}\left({a}_{i}\right)-{c}_{i}+{Q}_{ij}\left({a}_{i},{a}_{j}\right)+\sum _{k\in \mathsf{\Gamma}\left(i\right)\backslash j}{\mu}_{ki}({a}_{i})-{\mathcal{C}}_{i,j}$ |

b. normalize the message ${\mu}_{ij}({a}_{j})$ for convergence |

c. send the message ${\mathcal{m}}_{1}={\mu}_{ij}({a}_{j})$ to the agent $j$ if different from previous |

d. check if ${\mu}_{ij}({a}_{j})$ is closed the previous message value |

26. END FOR//for all neighbors. |

27. Calculate ${a}_{i}^{\prime}$ with (6) the optimal individual |

action ${a}_{i}^{\prime}=\mathrm{arg}\text{}\underset{{a}_{i}}{\mathrm{max}}\{{0,\text{}[Q}_{i}\left({a}_{i}\right)-c({a}_{i})+\sum _{j\in \mathsf{\Gamma}\left(i\right)}{\mu}_{ji}({a}_{i})]\}$ |

28. $\mathrm{Determine}\text{}{\mathit{a}}^{\prime}$$\text{}\mathrm{the}\text{}\mathrm{optimal}\text{}\mathrm{global}\text{}\mathrm{action}\text{}\mathrm{so}\text{}\mathrm{far}\text{}\mathrm{including}\text{}\mathrm{all}\text{}\mathrm{previous}\text{}{a}_{i}^{\prime}$ |

29. END FOR//all iterations |

30. $\mathrm{ELSE}\text{}\mathrm{IF}\text{}{m}_{2}=$ (a request for payoff evaluation) |

31. $\mathrm{Lock}\text{}{a}_{i}^{\prime}$$,\text{}\mathrm{set}\text{}{r}_{i}=0,$ send the evaluation request to all children. |

32. $\mathrm{IF}\text{}\mathrm{agent}\text{}i$ is a leaf initiate the accumulation payoff |

33. END IF |

34. ELSE IF ${m}_{3}$ = (request to calculate the accumulated payoff denoted by ${r}_{i}$ for agent $i$) |

35. ${r}_{i}={r}_{i}+{r}_{j}//\mathrm{add}\text{}\mathrm{payoff}\text{}\mathrm{of}\text{}\mathrm{child}\text{}{r}_{j}$ |

36. $\mathrm{IF}\text{}\mathrm{the}\text{}\mathrm{agent}\text{}i$ is a root sends the global payoff to all children |

37. ELSE sends the global payoff to parent |

38. END IF |

39. $\mathrm{ELSE}\text{}\mathrm{IF}\text{}{m}_{4}=$ (evaluate Global Reward) |

40. Calculate the evaluated global reward |

41. Use anytime: |

42. IF $R\ge r$ |

43. ${a}_{i}^{*}={a}_{i}^{\prime}$ and $r=R({a}_{i}^{\prime}$) |

44. ELSE |

45. ${a}_{i}^{*}={a}_{i}^{\prime}$ |

46. END IF |

47. END IF$//(\mathrm{message}\text{}s=0$) |

48. END WHILE |

49. Return the best joint actions ${\mathit{a}}^{*}\text{}\mathrm{and}\text{}\mathrm{the}\text{}\mathrm{global}\text{}\mathrm{reward}\text{}R\left(\mathit{a}\right)$ accumulated so far |

50. END IF//(segmentation message $\mathit{s}=1\text{}\mathrm{o}\mathrm{r}$ 0) |

## 7. Factored-Value MCTS Hybrid Cost Max-Plus Method

**n**. Nevertheless, progress towards the subsequent global state cannot be assured. Our proposed approach is capable of surmounting this challenge.

**s**is received by either of them (red arrow). In this scenario, there will be no communication between the two groups of six agents. The results are plotted in Figure 10 for the first tree agents, Q − 1., Q − 2., Q − 3., which will be configured in a Spanning Tree configuration (distributed or decentralized). The payoff value and the convergency performance for the second group (centralized in CG settings) of agents, Q − 4., Q − 5., and Q − 6., will be plotted, respectively, as shown in Figure 11. We maintained the same action costs.

## 8. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Choudhury, S.; Gupta, J.K.; Morales, P.; Kochenderfer, M.J. Scalable Anytime Planning for Multi-Agent MDPs. arXiv
**2021**, arXiv:2101.04788. [Google Scholar] - Amato, C.; Oliehoek, F. Scalable planning and learning for multi agent POMDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
- Cotae, P.; Kang, M.; Velazquez, A. A Scalable Real-Time Multiagent Decision Making Algorithm with Cost. In Proceedings of the 2021 IEEE Symposium on Computers and Communications (ISCC), Athens, Greece, 5–8 September 2021; pp. 1–6. [Google Scholar]
- Cotae, P.; Kang, M.; Velazquez, A. A Scalable Real-Time Distributed Multiagent Decision Making Algorithm with Cost. In Proceedings of the IEEE 19th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 8–11 January 2022; pp. 745–746. [Google Scholar] [CrossRef]
- Kok, J.R.; Vlassis, N. Collaborative multi agent reinforcement learning by payoff propagation. J. Mach. Learn. Res.
**2006**, 7, 1789–1828. [Google Scholar] - Kok, J.R.; Vlassis, N. Using the max-plus algorithm for multi agent decision making in coordination graphs. In Robot Soccer World Cup; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1–12. [Google Scholar]
- Vlassis, N.; Elhorst, R.; Kok, J.R. Anytime algorithms for multi agent decision making using coordination graphs. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), The Hague, The Netherlands, 10–13 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 1, pp. 953–957. [Google Scholar]
- Best, G.; Cliff, O.M.; Patten, T.; Mettu, R.R.; Fitch, R. Dec-MCTS: Decentralized planning for multi-robot active perception. Int. J. Robot. Res.
**2019**, 38, 316–337. [Google Scholar] [CrossRef] - de Nijs, F.; Walraven, E.; De Weerdt, M.; Spaan, M. Constrained multi agent Markov decision processes: A taxonomy of problems and algorithms. J. Artif. Intell. Res.
**2021**, 70, 955–1001. [Google Scholar] [CrossRef] - Gupta, J.K. Modularity and Coordination for Planning and Reinforcement Learning. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2020. [Google Scholar]
- Patra, S.; Velazquez, A.; Kang, M.; Nau, D. Using online planning and acting to recover from cyberattacks on software-defined networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 15377–15384. [Google Scholar]
- Guestrin, C.; Lagoudakis, M.; Parr, R. Coordinated reinforcement learning. ICML
**2002**, 2, 227–234. [Google Scholar] - Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res.
**2002**, 27, 819–840. [Google Scholar] [CrossRef] - Revach, G.; Greshler, N.; Shimkin, N. Planning for Cooperative Multiple Agents with Sparse Interaction Constraints. In Proceedings of the 6th Workshop on Distributed and Multi-Agent Planning (DMAP) at ICAPS 202, Haifa, Israel, 2020; Available online: https://icaps20subpages.icaps-conference.org/wp-content/uploads/2020/11/The-online-Proceedings-of-the-6th-Workshop-on-Distributed-and-Multi-Agent-Planning-DMAP-at-ICAPS-2020.pdf (accessed on 15 November 2023).
- Pettie, S.; Ramachandran, V. An optimal minimum spanning tree algorithm. J. ACM
**2002**, 49, 16–34. [Google Scholar] [CrossRef] - Czech, J. Distributed Methods for Reinforcement Learning Survey. In Reinforcement Learning Algorithms: Analysis and Applications; Springer: Cham, Switzerland, 2021; pp. 151–161. [Google Scholar]
- Li, R.; Patra, S.; Nau, D.S. Decentralized Refinement Planning and Acting. In Proceedings of the International Conference on Automated Planning and Scheduling, Guangzhou, China, 2–13 August 2021; Volume 31, pp. 225–233. [Google Scholar]
- Hayes, C.F.; Reymond, M.; Roijers, D.M.; Howley, E.; Mannion, P. Risk Aware and Multi-Objective Decision Making with Distributional Monte Carlo Tree Search. arXiv
**2021**, arXiv:2102.00966. [Google Scholar] - Rossi, F.; Bandyopadhyay, S.; Wolf, M.T.; Pavone, M. Multi-Agent Algorithms for Collective Behavior: A structural and application-focused atlas. arXiv
**2021**, arXiv:2103.11067. [Google Scholar] - Grover, D.; Christos, D. Adaptive Belief Discretization for POMDP Planning. arXiv
**2021**, arXiv:2104.07276. [Google Scholar] - Guestrin, C.; Koller, D.; Parr, R. Multi agent Planning with Factored MDPs. Adv. Neural Inf. Process. Syst.
**2001**, 14, 1523–1530. [Google Scholar] - Landgren, P.; Srivastava, V.; Leonard, N.E. Distributed cooperative decision making in multi-agent multi-armed bandits. Automatica
**2021**, 125, 109445. [Google Scholar] [CrossRef] - Mahajan, A.; Samvelyan, M.; Mao, L.; Makoviychuk, V.; Garg, A.; Kossaifi, J.; Whiteson, S.; Zhu, Y.; Anandkumar, A. Reinforcement Learning in Factored Action Spaces using Tensor Decompositions. arXiv
**2021**, arXiv:2110.14538. [Google Scholar] - Cotae, P.; Reindorf, N.E.A.; Kang, M.; Velazquez, A. Work-in-Progress: A Hybrid Collaborative Multi Agent Decision Making Algorithm with Factored-Value Max-Plus. In Proceedings of the 2023 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Istanbul, Turkey, 4–7 July 2023; pp. 438–443. [Google Scholar] [CrossRef]

**Figure 2.**Interaction on edge between two agents in centralized Max-Plus algorithm. (

**a**): The agent $i$ is a receiver from all its neighbors’ payoffs. (

**b**): The agent $i$ is sending the message ${\mu}_{i,j}$ to agent $j$.

**Figure 3.**Coordination Graph with 6 agents. Agent 3 is sending messages ${\mu}_{32}\left({a}_{2}\right)$, ${\mu}_{34}\left({a}_{4}\right)$ to its neighbors 2 and 4.

**Figure 4.**Example of spanning tree graph associated with CG from Figure 3.

**Figure 8.**Illustration of the segmentation process setting for Hybrid Factored Value Cost Max-Plus method [24] (reproduced with permission from P. Cotae, N. E. A. Reindorf, M. Kang, and A. Velazquez, “Work-in-Progress: A Hybrid Collaborative Multi Agent Decision Making Algorithm with Factored-Value Max-Plus”, 2023 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Istanbul, Turkiye, 2023, pp. 438–443, doi: 10.1109/BlackSeaCom58138.2023.10299698).

**Figure 9.**Cost Hybrid Factored Value MCTS Max-Plus Method [24] (reproduced with permission from P. Cotae, N. E. A. Reindorf, M. Kang, and A. Velazquez, “Work-in-Progress: A Hybrid Collaborative Multi Agent Decision Making Algorithm with Factored-Value Max-Plus”, 2023 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Istanbul, Turkiye, 2023, pp. 438–443, doi: 10.1109/BlackSeaCom58138.2023.10299698).

**Figure 10.**Performance of Cost-Distributed Max-Plus Algorithm [24] (Reproduced with permission from P. Cotae, N. E. A. Reindorf, M. Kang, and A. Velazquez, “Work-in-Progress: A Hybrid Collaborative Multi Agent Decision Making Algorithm with Factored-Value Max-Plus”, 2023 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Istanbul, Turkiye, 2023, pp. 438–443, doi: 10.1109/BlackSeaCom58138.2023.10299698).

**Figure 11.**Performance of Cost Centralized Max-Plus Algorithm [24] (Reproduced with permission from P. Cotae, N. E. A. Reindorf, M. Kang, and A. Velazquez, “Work-in-Progress: A Hybrid Collaborative Multi Agent Decision Making Algorithm with Factored-Value Max-Plus”, 2023 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Istanbul, Turkiye, 2023, pp. 438–443, doi: 10.1109/BlackSeaCom58138.2023.10299698).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Alexander-Reindorf, N.-E.; Cotae, P.
Collaborative Cost Multi-Agent Decision-Making Algorithm with Factored-Value Monte Carlo Tree Search and Max-Plus. *Games* **2023**, *14*, 75.
https://doi.org/10.3390/g14060075

**AMA Style**

Alexander-Reindorf N-E, Cotae P.
Collaborative Cost Multi-Agent Decision-Making Algorithm with Factored-Value Monte Carlo Tree Search and Max-Plus. *Games*. 2023; 14(6):75.
https://doi.org/10.3390/g14060075

**Chicago/Turabian Style**

Alexander-Reindorf, Nii-Emil, and Paul Cotae.
2023. "Collaborative Cost Multi-Agent Decision-Making Algorithm with Factored-Value Monte Carlo Tree Search and Max-Plus" *Games* 14, no. 6: 75.
https://doi.org/10.3390/g14060075