AlphaGo [100 to] zero
While in Warwick last week, I read a few times through Nature article on AlphaGo Zero, the new DeepMind program that learned to play Go by itself, through self-learning, within a few clock days, and achieved massive superiority (100 to 0) over the earlier version of the program, which (who?!) was based on a massive data-base of human games. (A Nature paper I also read while in Warwick!) From my remote perspective, the neural network associated with AlphaGo Zero seems more straightforward that the double network of the earlier version. It is solely based on the board state and returns a probability vector p for all possible moves, as well as the probability of winning from the current position. There are still intermediary probabilities π produced by a Monte Carlo tree search, which drive the computation of a final board, the (reinforced) learning aiming at bringing p and π as close as possible, via a loss function like
(z-v)²-<π, log p>+c|θ|²
where z is the game winner and θ is the vector of parameters of the neural network. (Details obviously missing above!) The achievements of this new version are even more impressive than those of the earlier one (which managed to systematically beat top Go players) in that blind exploration of game moves repeated over some five million games produced a much better AI player. With a strategy at times remaining a mystery to Go players.
Incidentally a two-page paper appeared on arXiv today with the title Demystifying AlphaGo Zero, by Don, Wu, and Zhou. Which sets AlphaGo Zero as a special generative adversarial network. And invoking Wasserstein distance as solving the convergence of the network. To conclude that “it’s not [sic] surprising that AlphaGo Zero show [sic] a good convergence property”… A most perplexing inclusion in arXiv, I would say.
Leave a Reply