Estimating player strength from team match outcomes

Or: I wanted to play with the SpawELO data, so I did.

I love probabilistic programming languages. You just write a program as if you were authoring a simulation of the universe itself—then let the computer strip out the regions of its parameter space that don’t align with your data.

This article is about simulating a little universe where every Dota 2 player’s skill can be summed up by a single numerical value.

It started with this article about SpawELO. Reading it reminded me of a probabilistic generative model of tug-of-war that sparked my interest every time I revisited Probabilistic Models of Cognition by Noah D. Goodman and Joshua B. Tenenbaum.

Like Elo, this model estimates a “strength” value for each player that tells you who is more likely to win against who. And like Elo, you infer those strength values just by observing who wins (in tug-of-war). This is the textbook model, written in WebPPL:

Infer (function () {
  var strength = mem(function (person) {
    return Math.max(gaussian(1, 1), 0.01)
  })
  var lazy = function (person) { return flip (1/3) }
  var pulling = function (person) {
    return lazy(person) ? strength(person) / 2 : strength(person)
  }
  var totalPulling = function (team) {
    return sum (map(pulling, team))
  }
  var beats = function (team1, team2) {
    return totalPulling(team1) > totalPulling(team2)
  }

  condition(beats(['bob', 'mary'], ['tom', 'sue']))
  condition(beats(['bob', 'sue'],  ['tom', 'jim']))

  return {bobStrength : strength('bob')}
})

Because it’s a probabilistic generative model, instead of getting one “strength” value out, you actually get a chart of how likely every possible strength value is. That’s the main reason I use this type of model: I love understanding the distribution!

For example, this graph shows an estimate of bob’s strength:

It shows that the highest likelihood strength value is between 1 and 2, but values as low as 0 and as high as 4 can’t be entirely ruled out.

The shape tells a story, too. Without any information about a person, the model estimates strength on a bell curve centered at 1. But the chart above is based on the outcomes of the two games mentioned near the end of that model definition: that Bob and Mary won against Tom and Sue, and Bob and Sue won against Tom and Jim. Those wins provide enough information to start to say something more about Bob’s strength than “it’s on a bell curve”—so we get a sharper peak that’s slightly higher than 1.

Throwing the LAN party data in

You could just plug the data from the SpawELO Colab into the WebPPL program… but I have a love-hate relationship with WebPPL at the moment, so I used Mathematica instead. ¯\_(ツ)_/¯ Here are the “strength” values I inferred using a slightly modified model:

Here are the final SpawELO values, for comparison:

Using my computed “strength” values, the fairest 4v5 matchup would be:

Team 1 (strength: 5.3±0.4)	Team 2 (strength: 5.3±0.5)
Spawek	Bania
Bixkog	Hypys
J	Muhah
Goovie	Status
	Vifon

But there are plenty of other good ones:

Spawek, Bixkog, J, Goovie (strength: 5.3±0.4)	Bania, Hypys, Muhah, Status, Vifon (strength: 5.3±0.5)
Spawek, Bixkog, J, Status (strength: 5.3±0.4)	Bania, Goovie, Hypys, Muhah, Vifon (strength: 5.4±0.5)
Spawek, Hypys, Bania, Vifon (strength: 5.4±0.5)	Bixkog, Goovie, J, Muhah, Status (strength: 5.3±0.5)
Spawek, Muhah, Bania, Goovie (strength: 5.3±0.5)	Bixkog, Hypys, J, Status, Vifon (strength: 5.4±0.5)
Spawek, Muhah, Bania, Status (strength: 5.3±0.5)	Bixkog, Goovie, Hypys, J, Vifon (strength: 5.4±0.5)
Spawek, Goovie, Vifon, Status (strength: 5.2±0.5)	Bania, Bixkog, Hypys, J, Muhah (strength: 5.4±0.5)
Bixkog, Muhah, Bania, Vifon (strength: 5.2±0.5)	Goovie, Hypys, J, Spawek, Status (strength: 5.4±0.5)

According to the values above, the teams in the original article don’t look nearly as fair:

So I’m curious how the games went. 😛

But I’ll spend the rest of this article digging into how I got my numbers.

How I modified the model for the LAN party

The idea behind WebPPL is really simple, and resembles the ML based on back-propagation that SpawELO used: write a function that simulates games of tug-of-war given everyone’s estimated strength, and let an optimizer find strength values that best fit the data.

Here’s that function for the LAN party model. All it does is pick random strengths for everybody and simulate some games.

The output looks like this:

But those results are completely random. They have no relation at all to actual matches. If we plot the histograms of all the player’s strengths, they all have the same curve:

That’s because, although it’s a model we can use for inference, all it does by itself is reflect our mental model of how players perform in games:

Every player has an inherent “strength” that never changes

Each game, every player has a 1/3 chance of being “lazy” and performing at only half strength

The winner of a game is whoever has the highest total strength after you take laziness into account

If you think that’s a reasonable approximation of how games and people work, then you’ll find the results satisfying. If you don’t, you can modify the model however you want. :) For example, it doesn’t account for sets of players who play especially well together—but it could! It doesn’t account for improvement over time—but it could! This type of model is very flexible.

Anyway, the trick of using this model to infer strength values is to filter the rows until they resemble real match outcomes. For example, if we filtered the table to just the rows where Alpinus won against Bania, we should see the estimate of Alpinus’s strength go up and Bania’s go down. And we do!

We can just keep doing that to plug in all the real LAN party data, but there’s another interesting bit of data we could incorporate. Spawek went further than match outcomes: each match has a winner and a measure of how even the match was. It’s encoded as a likelihood that the match went the way that it did, where a win in an “even” match has win_probability set to 0.75, or 0.95 for a “one-sided” game. It’s a measure of how predictable the outcome was.

We can use the win_probability field to assign weights to each one of our generated samples. For example, if we simulate a J v Dragon match and J wins, and J also won the corresponding real match that was “even”, then we can weigh that sample 0.75. If, on the other hand, J lost the real match, we could multiply that sample’s weight by (1-0.75)=0.25.

This is a lot like the filtering we did before, but softer: instead of eliminating the samples that disagree with the actual results entirely, we just give less weight to the samples that don’t agree as well with what actually happened.

To do this, we’re going to modify the model:

Calculating actual results

Now we’re ready.

I downloaded the data from the SpawELO Colab:

I extracted all the player names:

Then I made a version of the model that assigns sample weights according to the combined likelihood of every simulated win and loss:

The weights are very small because there are so many games. Each weight is the product of 35 likelihoods. The weight we’d assign to a sample predicting 35 one-sided games correctly is already small:

But the weight we’d assign a sample that mispredicted just ten of them is miniscule:

But that’s part of why I’m using Mathematica: I know what knobs to turn if floating point precision turns out to be a problem. :)

Anyway, let’s run the model with every match in the dataset!

To make the charts at the start of this article, I summarized every distribution as a mean + confidence interval, like:

Take-aways

I love using models like these. I love that I can just write my mental model for how a process works and start making inferences with it. I love that I get full joint probability distributions for any expression in the model.

I don’t love how incredibly expensive these models are to run. Generating the charts at the start of this article took about half an hour on my MacBook Air M1. I could do a lot better than uncompiled Wolfram Language, but no tech choices would allow me to approach the performance of a model that just needs to minimize loss and output a single strength estimate per player instead of the full distribution. But to me, understanding the uncertainty of the answer is worth it.

If you’re interested, I highly recommend reading the entire ProbMods book. It has embedded, executable, editable code samples every few paragraphs. It’s enjoyable enough to read that I think I’ve referred to it more than any other textbook(!).

Something I didn’t get to: strength distribution

I used a different distribution of strength than the original ProbMods code. When they randomly pick each player’s strength, they literally clamp the value sampled from a normal distribution, which (IMO) creates an unnatural peak at zero:

I think that was an accident, so in my models, I truncated it with renormalization instead:

This shape says that the most likely strength is 1, and there is a skill floor at 0.01, but no skill ceiling for the rare savants.

Something I didn’t get to: efficient inference

The strength values and charts at the start of this article were actually generated in a slightly different way than I described in the rest of the article. The model is basically the same, but I used the Metropolis-Hastings algorithm to pick the random values instead of generating each sample independently.

In short, if you actually pick completely random strength values every time, then you’ll waste a lot of time generating low-likelihood samples. The MH algorithm improves the situation slightly: it allows you to tweak strength values instead of generating them from scratch, so when a valid combination is found, you can explore around there instead of jumping to a whole new random point in the distribution. That would normally skew the results by only exploring high-likelihood regions, but the MH algorithm adds a sample selection process to ensure that you get accurate distributions out.

I left all this out of the article above because I’m still learning to use MH well, and there are better, more modern algorithms anyway (I think No-U-Turn sampling is state-of-the-art?). The results are basically the same because both algorithms converge to the same answers—MH just converges more quickly.