Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Elo scoring two years of Magic: The Gathering games (dylanlott.com)
40 points by shakezula on April 19, 2022 | hide | past | favorite | 51 comments


I think unfortunately Elo is misapplied here.

Elo is appropriate for chess, where there is no initial game-state variance, and no built-in advantage for either competitor except who goes first; that can be addressed by averaging, by using the results of tournaments where the competitors swap colors, or simply by maintaining a separate Elo as white and as black.

Similarly for Starcraft you can track Elo separately for Terran/Zerg/Protoss. (Technically you would also need to do the same by map, but anyway...)

With MTG, you have a huge effect from the quality of the deck. Unless you have each player play with each deck, there's no way to de-convolute the quality of the deck vs the quality of the player. And if you did have that data, Elo couldn't leverage it -- you'd need a more sophisticated model to account for that statistical effect.

Then there's the game-state variance you allude to... Regardless of how good you are at MTG, and even how good your deck is, you're going to lose a lot of games due to mana flood / mana screw / etc. When that happens to either player, the outcome of the game does not contain useful information about skill. Of course if you sample enough games, you can still figure out what is skill and what is chance, but using Elo with low-count datasets is bound to be misleading because it is designed for games of pure skill, where game outcomes contain information about relative skill levels 100% of the time. Maybe you could establish some rules about what games are appropriate to use as indicators of relative skill, and which ones must be discarded?

Anyway it's an interesting idea. Here's related reading for the MMR score used in Magic Arena:

https://hareeb.com/2021/05/23/inside-the-mtg-arena-rating-sy...


> With MTG, you have a huge effect from the quality of the deck. Unless you have each player play with each deck, there's no way to de-convolute the quality of the deck vs the quality of the player. And if you did have that data, Elo couldn't leverage it -- you'd need a more sophisticated model to account for that statistical effect.

How well a player chooses their deck is one of the factors that determines how good a player is. You can say the same thing about the other games: I'd probably have a better rating in chess if I didn't only play somewhat unsound gambits, and I'd definitely have a better rating in Starcraft if I didn't only do 2port wraith in TvZ.


It is very rock-paper-scissors even at top level


im not sure if its the same in magic, but when I played yugioh how well made a deck was was also just a indicator of how much money you had


I'm not so sure.

A deck is something you have. A build order, or a chess opening, is something you _know_ and therefore more or less what I'd be comfortable calling skill.


In my experience playing MTG (and other card games), when players discuss skill, they generally mean a (somewhat fuzzy) combination of both deckbuilding and "piloting" ability. It's understandable to want to draw a line between the two and say "I only want to evaluate in-game decision-making,", but, not only is that wildly impractical (it's going to be really hard to develop a model which fairly accounts for the fact that your buddy Jeff only likes playing decks which do nothing for 50 turns and then win the game on the spot iff no one else has a counterspell[0]), part of the way card game leagues work (again, in my experience) is that players spend a lot of time trying to figure out how to make their decks better and adapt them to what other players are doing. If you can't capture that effort, I honestly think you might be missing the point a little bit.

[0] Let's be clear, Jeff's deck is bad and he's going to lose a lot, even if he's a time-traveling supercomputer with the diplomatic finesse of Otto von Bismarck.


> when players discuss skill, they generally mean a (somewhat fuzzy) combination of both deckbuilding and "piloting" ability.

I think the "piloting" ability is mostly (but not entirely) independent of the deck. You can see this most plainly in draft, where everyone is basically playing with a new deck. There are "soft" skills that are contextual and format-dependent, like knowing the cards that you need to play around (white just foretold turn 2, is that a Doomskar? Maybe I shouldn't play a creature this turn, etc). There are "hard" skills that are almost always valid (generally wait until after combat to spend mana, cast instants during your opponent's end step, etc).

But certainly deckbuilding talent is not necessary, because anybody can grab a decklist and head to TCGPlayer.

On that note I'd guess that draft (or sealed) tournaments are the best scenario to measure pure skill using Elo alone, since going into a tournament, everyone has equal chance to open good cards.


Beyond a certain level, everyone has access to all the cards they want. That might cause a poor fit on the bottom levels, but intermediate to advanced levels, its not about ownership


>everyone has access to all the cards they want

Top tier commander decks can easily cost over $10,000 I suspect the vast majority of players do not have access to them.


That’s a fair point for Commander. But the GP seemed to be making a universal claim that ELO wouldn’t make sense in any case, even say Standard where the decks are <$500 (or Arena where you can get a meta deck for <$100, free if you have time to grind wildcards).

I think perhaps we could agree on an intermediate claim that Elo could work amongst the pool of players that can build any deck they want, which is a big pool for MTGA, and definitely a smaller pool for physical Standard, and probably very small for Commander.

I wonder if you could decompose the score by playing some games with a fixed deck too? Eg Arena challenges. How much of your overall Elo is just picking the right deck for the local meta, vs raw quality of plays?


That depends fully on the format. Original article doesn't specify afaik, but the only sanctioned commander games I've ever played in a tournament format were limited events


MtG is like car racing. Sure, most people own cars (hi from California) and can theoretically race them. But anyone any good at racing has a lot of money.


I actually came up with a play style variation that avoids the mana flood / screw and my son and I use it when we play. Honestly, I find it a lot more fun.

You split your deck into two stacks. One with land and one with everything else. For your starting hand, you take 3 land and 4 of everything else.

Each draw phase, you pick which stack you draw from.

That’s it. Everything else stays the same but mana floods / screws completely stop.


It’s good for teaching younger players who still have temper problems, but there’s only so much of the game you can experience this way. And don’t expect to get to advanced or expert strategies without the game balance falling apart.

One knock on effect I’d predict is higher mana value cards would be substantially more playable. I expect a deck of walls + counterspells + removal + big finishers like the Eldrazi or even just Baneslayer Angel to be much more effective than it is now.

On the other end of the spectrum super low to the ground aggro strategies also get a huge bonus by simply never having to draw a land again.

Probably Storm (play a bunch of cheap spells, typically with a discount or with effects that give you mana when you cast a spell) gets a huge boost as well as they can ensure they never fizzle out. Once the engine is going they’ll always win unless they get countered.

What lose out here are all the decks in the middle. The midrange, “fair” decks that are just trying to curve out with the best play each turn.

And all that’s not counting the rules headache with cards like Oracle of Mul-Daya, Fact or Fiction, Treasure Hunt, or Dark Confidant. Which pile does my Maze of Ith go in? Cultivate? Sol Ring? Faceless Haven?

With that said it’s also my personal opinion that variance just makes the game more enjoyable and widens the group of players you can compete against, as long as you have the emotional capacity to not take losses personally.


Variance is a critical lifeblood for card games, as demonstrated by the commander companion mechanic debacle, as a key lesson that mtg r&d has known but occasionally forgets. But not only at casual levels but many levels, mtg can definitely have too much starting hand quality variance that starts to reduce fun. Tournament mulligan rules and mtg arena starting hand sampling algorithms point to this.

I personally suspect that aggro would completely dominate a separate land deck meta if pushed hard enough. But I'm all on board for an alternative game design that invents new interesting questions to ponder, addresses a pain point of mtg design, and most of all makes it more fun for a kid.


> I personally suspect that aggro would completely dominate a separate land deck meta if pushed hard enough.

Agreed, I probably should have listed it first.

Even in the short term, look at when Arena ran their Treasure events where each upkeep that player would get a treasure token. By the end of the first day the event was dominated by “mono red” decks with 13 land and free splashes.


Fwiw, the amount of cards we have to play with is pretty limited to a few starter sets so very specialized decks don't really happen. I got rid of all of my cards from the 90s (still kicking myself there).

It's more that it keeps the game fun as he's just getting into it. You're guaranteed that both players are going to have playable draws.

For any setting with more advanced players there would definitely be side effects and a more polished set of rules in place for those special cards and circumstances.


It is actually the opposite. If you choose your pile then low cost decks are better because you can choose to spend fewer draws on land. Expensive cards are better if you get a draw from each pile each turn.


I did not order the list of effects by importance or impact, I ordered them by what came to mind while I was typing.

I touched on low to the ground aggro decks in the very next paragraph, and I agree that's probably the biggest issue.

However aggro decks becoming more powerful does not mean control decks cannot also become more powerful. Decks aren't a one dimensional plot of aggro to control with midrange in the middle. (though to be fair I did say "on the other end of the spectrum" in my initial post)

When I say higher mana value cards are more playable, I mean in the sense that they can be cast "on curve" much more often, because if you want to cast a 7 drop on turn 7 without ramp you can choose do so, while with a normal deck you might expect to hit 7 mana between turns 8 and 11. (Not a hard calc, just a gut check)

If you have to answer your opponents' threats 1-1 early you can refill with answers, and if you get ahead with 2-1s or 3-1s you can keep drawing land to play your expensive threat on curve.


The variance argument may be solid in general but I will say that mana flood and mana screw can be greatly alleviated through deck building and use of mulligan.

You don't often see it happen during high level play.

I used to be rather careless in how I planned the mana of my decks and rarely took a mulligan. I faced mana issues all the time. After putting more planning into my mana base and deciding on a careful strategy for when to take a mulligan I now rarely experience those issues. When I do it is mainly because I break my own rules out of greed when refusing to admit a hand with great cards is too low on mana.


I agree that Elo falls rather short for multiplayer games (the article's approach probably converges much more slowly, or fails to converge, than an approach which is built around supporting multiplayer contests, and the simplification for "board zaps" is likely just plain wrong--although that might be a limitation of how they recorded their games), but I don't think individual MTG games having a substantial amount of luck should really impact the usefulness of Elo (or similar systems such as Glicko). After all, Elo is just trying to find ratings which best predict a given game outcome, so the presence of good/bad draws should still be well-modeled by that idea, and in particular, for two given players (at a particular point in time and holding particular decks[0]), it stands to reason that you should be able to still find some pair of ratings Rx and Ry s.t. P(x beats y) = 1/(1+10^(Rx-Ry)/400).

That being said, the inherent randomness of MTG maybe means that in an ill-defined, abstract sense, it takes "more skill" to improve 100 Elo points in MTG than in Chess, because X% of your games have no meaningful decisions so you have fewer places to take advantage of your superior decision-making and, further, this probably has real implications for reasonable choices of K if you're running, say, MTG Arena, but the article is pretty clear that they're not doing anything especially rigorous when picking K in the first place, and honestly (IMO) it probably doesn't matter a whole lot if you're running a Friday night beer league with some friends or whatever.

[0] I agree with the sibbling comment that deck selection and deckbuilding is a large part of what magic players mean when they discuss skill, and it seems very reasonable to allow those things to be included in our model.


Elo isn't misapplied here. It's just that when game results have a higher luck factor, you get a narrower distribution with shorter tails. You don't get those 2800 Elo players like in chess, who have virtually a 100% chance of beating nearly everyone everytime. The best and worst players tend more to the center, but there's still meaningfulness behind the score.


> Regardless of how good you are at MTG, and even how good your deck is, you're going to lose a lot of games due to mana flood / mana screw / etc.

What makes this different from blind build order choices in Starcraft? The greed > safe > rush > greed interactions often set one player ahead pretty arbitrarily in the very early game.


It's more like if mineral placement was randomized at the start of a game, with players having uneven access.

What you are describing with the build order exists also. Many Magic decks have a single game plan ("rush", etc) and can only minimally adapt between games in a match (by swapping cards with a 15-card sideboard). The degree to how uneven a matchup is can vary a lot, and some decks are hybridized so it doesn't just devolve in to rock/paper/scissors


If the purpose of Elo's system was to predict the outcome of a game between two players who have no or limited prior interaction, it can be "misapplied" to great effect. While unfairness and randomness (starting as black vs. white in Chess) can bias and increase the variance of that estimate, it is still better than tossing a coin.


So Elo scores used to be used to track players until 5-10 years ago. And I think you can still view your score if you view your profile details in Magic Online.

Back then they were trying to identify the best magic players in the world. I think it started at 1600.

In the mid 90s it was hard to get locally sanctioned games so I manually tracked games in my college town and used it as my first blog ever to post the scores of local players for a year or so. I wish the site was archived but I never kept a copy when I went to work and abandoned the site. I remember going through the formula to calculate the score and I couldn’t get it to work in excel or javascript so I manually calculated it out on paper for a few dozen games a week. I think back about how much time I could have saved if I just had a little more programming skills.


I track ratings and player records for a local tabletop board game league, and the question of how to choose and implement a rating system ends up being pretty interesting, with a lot of literature to read if you start following citations.

Even if you have well-defined, sequential 2-player matches, where a widely-used model[0] exists in the literature, there are a wide variety of ways to estimate player ratings from game results, which all have their own assumptions and various tuning parameters.[1] If your domain also includes team-based or multiplayer matches (or some other weird feature that you want to account for), you then get to decide whether you want to try and hack together something using Elo because it'll be "close enough"[5], or whether you want to try and use (or build!) a more sophisticated system which captures those nuances, such as Microsoft's TrueSkill(TM) system[6].

[0] The so-called "Bradley-Terry model" https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

[1] Beyond Elo, which is described in the article, you have things like Glicko/Glicko-2[2], which still does incremental updates to participants after each match but try to track rating uncertainty/player growth in a more sophisticated way, to systems like Edo[3] and Whole-History rating[4], which attempt to find the maximum-likelihood rating for all players at all points in time simultaneously.

[2] https://en.wikipedia.org/wiki/Glicko_rating_system

[3] http://www.edochess.ca/

[4] https://www.remi-coulom.fr/WHR/WHR.pdf

[5] This is (obviously) the approach taken in the article, and IMO is probably the right answer unless you're a huge nerd who's interested in wasting a ton of time for not-much practical benefit.

[6] https://en.wikipedia.org/wiki/TrueSkill


If you're open to trying it out, I wrote an open-license alternative to TrueSkill: https://github.com/philihp/openskill.js, which has been ported to half a dozen languages. I'd love your feedback on it.


This looks sweet, I'll definitely play around with it a bit.


Interesting take on the multiplayer ELO scoring. Another approach I've seen is to essentially treat an n-player game as an (n 2) set of two-player games, so, in the case where A>B>C>D, we have A wins three games (vs B, C and D), B wins two games (plus her loss against A), C wins one game (Plus losses against A and B) and D loses three games. The advantage of this over the model described in the article is that it more gracefully handles cases where there are what the author called multi-player zaps. In that instance, where say D is eliminated first and B and C simultaneously, A beats B, C and D still, but B and C are treated as being a tie¹ and both beat D.

1. A tie is not a strictly neutral event in many ELO scoring systems: usually it means that the higher-ranked player loses some ELO while the lower-ranked player gains some, just not as much as in a straight victory.

For team-based play (like with Spades), on Board Game Arena, they treat the partners as having tied which is, I think, incorrect. A better approach is probably to treat it as a match between two players where each team's ELO is the mean of their individual member's averages. The tie approach means that a strong player is penalized for having a weaker partner.


That's an interesting approach for treating them as (n 2) sets, I might have to try modeling that out and applying that to the same data to see how it changes the numbers.

I have a number of table zaps recorded in subjective data, and I've considered a similar approach; treating a game recorded as a table zap as A winning a game against B, C, and D, however you're right, that doesn't handle the case of D being out, and then B and C being zapped at the same time. I think that's a subtle but important distinction.

Re: two player matches - That's a much better approach to my naive interpretation, definitely going to implement that sometime.


Is MtG one of the few games where being able to spend $10,000 USD puts you at a significant advantage (if playing vintage?). I think Vintage is my favorite format but it is becoming more expensive than high end watch collecting.


I'd say the number of games like that is not few. 10k in equipment in most sports will give you an advantage over 1k in equipment.

It's more accurate to say that the entry cost into the Vintage format is 10k (or whatever the cost of the deck you want to play). You can't just throw money at the deck and increase its winrate unless your deck starts out suboptimal.


Almost nobody plays vintage. Legacy is largely dead. The most expensive competitive format is modern, where competitive decks can run in the low thousands.

1000 or 2000 in equipment is absolutely normal for a lot of competitive endeavors.


Jeff Lynne is getting into some esoteric things to make music for.


I too am old.


At one time I wanted to try and independently measure decks and players. I modeled much of the game and created a rudimentary AI to play the decks. My goal was to be able to compare decks to tell you which one was "better". As I went, I thought of cool ways to compare play strategies too. It was a really fun project but in the end I succumbed - the model was getting more and more sophisticated but it was still far from complete. it's in my graveyard of cool projects that I got to 50-70% complete.


> My goal was to be able to compare decks to tell you which one was "better".

This would be a really out-of-the-box way to compensate for the deck quality bias I allude to in my other post -- normalize the effect of the deck on game outcomes by using a static "deck quality" score.

I suspect that coming up with halfway decent "deck quality scores" is an extremely difficult problem, though. It's not much of a leap from there to imagine using a computer to solve for the best possible deck in the format, the implications of which are terrifying for competitive Magic (and would be priceless to card speculators)


I think that it's not possible to normalize a single 'deck quality' score as the effectiveness of the deck depends on its opponents, you can have a deck that's good against some decks and weak against others in an intransitive manner; so the deck quality is conditional on the frequency of other decks in the 'competitor pool' i.e. the metagame. Game theory states that if there is not a single dominant deck (and I think that there would not in MtG) then there should be a Nash equilibrium of mixed strategies e.g. I pull out deck A with x% probability and deck B with y% probability, but with MtG rules that likely involves a distribution of many decks with different counterstrategies and counter-counter-strategies.


Deck quality scores is a huge problem, but you're absolutely right, it quickly bubbles out into exponential problem spaces. For example, even among a given Commander deck selected for a given matchup of `x` players, that deck list could have changed each of the last `n` games.

For this reason, the "assume a sphere with no friction" joke here is that deck selection, lock-in / mulligan processes, information asymmetry, and turn order, all being assumed to be equal and at that player's local maximum.


I was starting out with baby steps related to how well balanced your mana was. I would calculate the likelihood of a particular permanent being cast on turn #0-n. Never got to the point of creating a single index to score a deck overall. I had a long way to go. But I imagined taking some clever machine learning algorithms to help find suggested cards and swapping those in to create suggested decks.

And I imagined this all as a service people would pay for, lol.


I like analyzing any given permanents chance to be cast, that's neat. Did you model each card's power / importance at all or were they all treated as equal?


The player AI driving the game would score a permanent based on things like power, toughness, cmc with varying weights. And it would use that score to decide which permanent to play when it could cast more than one. A pretty simple model but probably an effective starting point.

Actually, it was a bit more abstract: it scored a game state where it considered this players permanents, the opponents permanents, this players life total, opponents life total(s).


There's already an Elo project for sanctioned events at the Grand Prix level and above: http://www.mtgeloproject.net/

It uses the public pairings and results that were published each round for events all the way back to the 90s. Unfortunately, there are less competitive MTG events these days, so most peoples' ratings stop in early 2020, but that's another topic altogether.


The trueskill variant of the elo algorithm has a publicly available python module.

my understanding is it handles new players and teams better than straight elo.

My use for it was to keep track of winners during a mario kart tournament and see if it could predict the winners.

it did ok.

https://trueskill.org/


I'm assuming your Elo-calculation code is looking at one match at a time. If you want to level-up the mathematical precision, check out Bayeselo from Remi Coulom: https://www.remi-coulom.fr/Bayesian-Elo/


> If you want to level-up the mathematical precision

I absolutely do! Thanks for the link, I’ll definitely check it out.


It is an implementation of the Bradley-Terry model. There are a few other implementations mentioned elsewhere in the comments.


This project should consider using data gathered at 17lands to check how well it scores.


Anyone else read the title as ELO scoring (as in electric light orchestra is scoring)


Came to say that MTG Arena is pretty cool for an old MTG player like me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: