Deep Reinforcement Learning from Self-Play in No-limit Texas Hold'em Poker

T.-V. Pricope

doi:10.24193/subbi.2021.2.04

T.-V. Pricope The University of Edinburgh, School of Informatics, 10 Crichton St, Newington, Edinburgh EH8 9AB, United Kingdom

DOI: https://doi.org/10.24193/subbi.2021.2.04

Abstract

Imperfect information games describe many practical applications found in the real world as the information space is rarely fully available. This particular set of problems is challenging due to the random factor that makes even adaptive methods fail to correctly model the problem and find the best solution. Neural Fictitious Self Play (NFSP) is a powerful algorithm for learning approximate Nash equilibrium of imperfect information games from self-play. However, it uses only crude data as input and its most successful experiment was on the in-limit version of Texas Hold’em Poker. In this paper, we develop a new variant of NFSP that combines the established fictitious self-play with neural gradient play in an attempt to improve the performance on large-scale zero-sum imperfect information games and to solve the more complex no-limit version of Texas Hold’em Poker using powerful handcrafted metrics and heuristics alongside crude, raw data. When applied to no-limit Hold’em Poker, the agents trained through self-play outperformed the ones that used fictitious play with a normal-form single-step approach to the game. Moreover, we showed that our algorithm converges close to a Nash equilibrium within the limited training process of our agents with very limited hardware. Finally, our best self-play-based agent learnt a strategy that rivals expert human level.

References

[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[2] Lambert Iii, Theodore J., Marina A. Epelman, and Robert L. Smith. ”A fictitious play approach to large-scale optimization.” Operations Research 53.3 (2005): 477-489.
[3] Nevmyvaka, Yuriy, Yi Feng, and Michael Kearns. ”Reinforcement learning for optimized trade execution.” Proceedings of the 23rd international conference on Machine learning. 2006
[4] Urieli, D. and Stone, P. (2014), “Tactex’13: a champion adaptive power trading agent.” In Proceedings of the 13th International Conference on Autonomous Agents and Multi-Agent Systems, pages 1447–1448.
[5] Silver, David, et al. ”Mastering the game of go without human knowledge.” nature 550.7676 (2017): 354-359.
[6] Gary Linscott, “Leela Chess Zero”, 2018.
[7] Arulkumaran, Kai, Antoine Cully, and Julian Togelius. ”Alphastar: An evolutionary computation perspective.” Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019.
[8] Brown, George W. ”Iterative solution of games by fictitious play.” Activity analysis of production and allocation 13.1 (1951): 374-376.
[9] Heinrich, Johannes, Marc Lanctot, and David Silver. ”Fictitious self-play in extensive- form games.” International Conference on Machine Learning. 2015.
[10] Heinrich, Johannes, and David Silver. ”Deep reinforcement learning from self-play in imperfect-information games.” arXiv preprint arXiv:1603.01121 (2016).
[11] Zhang, Li, et al. ”Monte Carlo Neural Fictitious Self-Play: Approach to Approximate Nash equilibrium of Imperfect-Information Games.” arXiv preprint arXiv:1903.09569 (2019).
[12] Noam Brown, Tuomas Sandholm, “Safe and Nested Subgame Solving for Imperfect Information Games”, 2017. 31st Con-ference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
[13] Noam Brown1, Tuomas Sandholm, “Superhuman AI for multiplayer poker”, 2019. Brown et al., Science 365, 885–890
[14] Bowling, Michael, et al. ”Heads-up limit hold’em poker is solved.” Science 347.6218 (2015): 145-149.
[15] Pricope, T.V.. A View on Deep Reinforcement Learning in Imperfect Information Games. Studia Universitatis Babes, -Bolyai Informatica, [S.l.], v. 65, n. 2, p. 31-49, dec. 2020. ISSN 2065-9601.
[16] Mnih, Volodymyr, et al. ”Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602 (2013).
[17] Jeff S. Shamma and Gurdal Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to Nash Equilibria.”, 2005.
[18] Denis Richard Papp, “Dealing with Imperfect Information in Poker.”, 1998.
[19] Van Hasselt, Hado, Arthur Guez, and David Silver. ”Deep reinforcement learning with double q-learning.” Proceedings of the AAAI conference on artificial intelligence. Vol. 30. No. 1. 2016.
[20] Yakovenko, Nikolai, et al. ”Poker-CNN: A pattern learning strategy for making draws and bets in poker games using con-volutional networks.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. No. 1. 2016.
[21] Watkins, C. J. and Dayan, P. “Q-learning”, 1992. Machine learning, 8(3-4):279–292.