Calculating Jam Voting Results The complexities and impossibilitie...

11 years ago

Calculating Jam Voting Results

The complexities and impossibilities of pleasing everyone.

This is a long and tedious post, but I ask you to please read all of it before commenting. There are many things discussed in the post and your question/concern may be answered by a single sentence somewhere in it.

Top-Rated Games Shouldn’t Be Upvoted or Downvoted By Cheating

I made multiple accounts to uprate my game, and got every member of my family to rate it a 5. Why isn’t my game in the top ten?

I made another account to rate every game a 1 so I would win! Why didn’t I?

Uhhhhh….. o.o

We invalidated votes from IP addresses that were detected to be cheating by upvoting their own entries or downvoting other entries. I don’t want to go into too much detail about this because the more transparent the cheat-detection system is, the easier it is to get past it. We detected many cheaters (around 2,500 votes).

We define uprating as one person making multiple accounts to rank a game highly and artificially inflate the score.

Downrating is defined as one person giving very low ratings to many games in order to lower the scores of any entries above them in the rankings.

Someone actually hacked into other peoples’ Facebook accounts to sign up and make it appear as if different people were rating the same game… from the same IP address. On Facebook, they were all in different locations, from different colleges, and were mostly graduates, which makes having the same IP address very unlikely.

There was also someone that gave their own game a massive amount of 4 ratings so as not to arouse suspicion.

And, of course, there were the many typical uprating and downrating schemes that we deal with constantly on the normal Game Jolt site.

Hopefully this sheds some light on the kind of craziness we have to think about and detect. There may be some cases where you get caught by the system, but it was just because you had all your family members rate your game a 5. While this isn’t the same exact kind of problem of straight-up cheating, it’s not exactly fair for the rest of the people in the jam.

We didn’t disqualify games that were found to be uprated, we just disqualified those votes. If you saw the vote count on your games go down, it’s because those ratings were from cheaters. This is probably a good thing since if you weren’t the one cheating, it was most likely downraters trying to crush your game.

Top-Rated Games Shouldn’t Just Be the Highest Average Score

Why even do weighting? Just use the freakin’ average!

This jam actually highlighted well why we don’t just use the unweighted average. There are currently 29 jam games, all rated an average of 5, with very low vote counts that would be ranked higher than you. If we didn’t invalidate the cheating upvotes, it would have been even worse.

A threshold was suggested to weed this out, but that can quickly become arbitrary. For example, many of the games with an average rating of 5 have only a couple of ratings, but there is a game in there with nine 5 ratings. How do we come up with this magical threshold and will it hold true for all jams, big and small? What if it was a small “neighborhood” jam with only ten games and little media attention? Thresholds can be arbitrary, but straight averages don’t solve the other issues that we’ll discuss.

In all honesty, showing the straight average on game pages was probably not a good idea. I did that for visibility into your average in case you cared how your straight average rating stacked up against others without the adjustments, but I think it has actually caused more confusion and frustration than positive input.

How the Weighting Works

The base system used for weighting is a simplified Bayesian method. Bayesian probability can be defined thusly:

Bayesian probability belongs to the category of evidential probabilities; to evaluate the probability of a hypothesis, the Bayesian probability specifies some prior probability, which is then updated in the light of new, relevant data (evidence).

This matches very closely with what we desire in a voting system. Namely, that we have a small set of data, but would like to make an informed decision about the probability that a game should be given a certain rating. Think about the case of a straight average, where a game with one 5 rating would get an average of 5. A game with three 5s and one 4 would get an average of 4.75. That would rank it below the game with just one vote.

The reason is that we don’t have enough data to assume the game with an average of 5 really does deserve a 5. It might, but we only have one vote, so it’s unlikely. The probability is currently low. As the game amasses more votes, though, the probability that the game deserves the average that it’s been given increases.

So, we start off every game at the average. Then as the game gets more votes, the probability increases towards the straight average, so we pull the weighted average either up or down depending on the strength of the vote. 1s pull the average down, and 5s pull it up.

This is why you will see a lot of games without many votes somewhere in the middle of the result set. Their averages may be much lower than your game’s average, but we don’t have enough data to make an informed decision as to whether they are actually “worse” than yours. For example, if the current straight average across all games in the jam is 4, and your game has ten ratings with a straight average of 3.5, but there’s a game with an average of 2 from just one vote, it will actually be rated higher than your game because we didn’t have enough confidence that their game really deserves to be a 2.

This system isn’t perfect, of course, but it helps in many cases such as with games that have low vote counts that get rated really high or really low.

It does suck in a way since it produces a lot of games that end up being in the middle of the listings because they didn’t have enough votes to move very far away from the average score. If your game is in the lowest 100, it’s not because your game is terrible, it may just be because you were rated lower than the average with a lot of votes cast on your game. And the average is pretty high since people tend to mostly vote on the games they like, rather than also voting lower on the games they don’t like. The average rating across all games for this jam is 4.1863. I’m not fully sure how to fix that problem at this time, so any input on that would be appreciated!

TL;DR: All games start at the straight average for the jam. As your game gets more votes, it’ll pull your weighted rating closer to the straight average because we have more confidence in that straight average based on the number of votes cast.

Being Top-Rated Shouldn’t Be a Popularity Contest

You have 1,000 ratings, your average is 5, and you got there because you’re internet-famous. You should be even MOAR FAMOUS!

We implemented user-weighting to the results, which is probably the biggest cause of discrepancy between straight average and weighted average. The idea is that if you only played one game in the jam, how do you know it should really be a 5? What do you have to compare it against? Users that have played more games in the jam are more likely to know each game’s actual relative score.

So if someone gave more ratings, each of their votes was given more “weight” (up to a point).

This also helped balance out the cases where famous/popular internet people were able to amass their fanbases together and crush the system with 5 votes. This feels pretty good as someone that’s already made it, but feels really bad when you’re the smaller developer that made a great game that’ll never get the spotlight because you’re relatively unknown. This can also happen if you were just lucky enough to get a big YouTuber or some social network to go crazy for your game during the voting period.

Getting a top-rated game shouldn’t be a popularity contest.

If you’re still upset that your game which got a ton of votes because you’re a Big Deal didn’t place highly, it’s most likely because you didn’t impress the people that actually went through and played more than one game in the jam.

How Did User-Weighting Work?

Well, we basically assigned more weight to the users that voted on more games. The way it worked mathematically is that we multiplied your vote by your user weighting score. This gave off the effect that you had voted on each game more than once with the score you gave. Hence it would weight your score more heavily to the rating you had actually given it.

It was a logarithmic curve, meaning that your weighting increase per vote would soften as you voted more and more. This way, the users that voted on tons of games don’t have weights that were expontentially or even linearly more important.

This helped eliminate ratings by users who only came to vote on your game because they love you, rather than making an informed vote.

If you’re upset that you didn’t know all of this earlier, because you would have told your audience to vote on multiple games as well so that their votes would count for your game—well, that’s actually why we didn’t tell you sooner. It’s a gray area that feels very much like cheating or vote stuffing.

Most sites that let communities rate things (such as Reddit) have some sort of process for weeding out ratings from people who were obviously told to just uprate a certain post.

The average number of votes per user was ~7. There were 30 users that voted on more than 30 games. And 6 users cast over 100 votes. The two top-weighted users are Jupiter Hadley with 304 votes cast and Selzier with 262 votes cast. 8,759 total users voted on 768 games.

Participant-Only Jam Voting

We’ve discussed this in the past, and it may be something that we just won’t be able to agree upon.

Some people want the jam voting to be restricted to the people that participated. This is valid for many jams! It’s certainly not a bad thing.

We decided against it because this jam had a very specific theme. It was about fun to play and fun to watch. It was a jam very much focused on gamers rather than developers. Because of this, we let everyone vote. Yes, that’s right, we actually gave gamers a voice. Scary, yeah? This isn’t much different than the real world where gamers at large vote with their wallets. If anything, I believe it’s a good skill to have and a good thing to think through. We made sure to be up-front with this so that everyone knew what they were getting into.

If you still have concerns about letting every gamer, Youtuber, and developer vote, first see if your concern is addressed above with one of the systems outlined (such as user-weighting).

Game Sorting

There were some concerns about the default sorting method for jam games during the voting process. We opted to randomly sort the list of around 800 games for each user that came to the page.

We could have tried sorting games with less votes at the top, or games with lower ratings at the top. We decided against it since more people tend to skip over games they don’t like without voting and just vote on entries they do like. This would result in the potentially “crappier” games being at the top.

We could have sorted by developers that have rated more games in the jam than other devs. We decided against this since it’s not a participant-only voting jam. The voting period was also pretty short and developers had just busted their asses working on their games for three days straight with little sleep.

No Extension to the Voting Period

We decided not to extend the voting period even though there was a massive turnout because it would step on Ludum Dare. We chose the jam dates carefully before the jam was announced to try to wedge it between all the other high profile jams going on. From our experience in the past, jams like LD are so high-profile that they whisk away all the excitement and people’s time to vote.

Also, while there were a lot of games in the jam—way more than can be humanly played in seven days—looking at actual votes, it appears most people skimmed the games list and played games based solely on how good the game page looked.

And with the randomized sorting of games, everyone had a fair chance to get skimmed over by the gamers looking for some games to rate.

Removed Games

You may have seen the total count of games shifting during the voting period. There are a few reasons for this.

We went through and removed most games that we found to be cheating either by having started their games early, or by using assets that they didn’t create. We removed around 60 games that cheated in one of these ways.

We also hid games that never uploaded a build. This resulted in about 100 games being hidden from the jam.

I’m Sorry You Didn’t Win

There will still be people that are upset. People that think that their game deserved to be placed in the top 10 out of 800 games. There was a 1.25% chance of that happening with so many really great entries to compete against.

You most likely have a difference of opinion as to the top 10 than the current top 10, or even top 100. I do too! Just like you, I can’t change the results to match my exact desires. All results are based on user input. In the past, it was the Game Jolt staff that chose the top winners based on our own criteria, but that quickly became impossible when we started getting over 100 entries in a jam. Rating systems can’t be perfect, and this one certainly isn’t, although I don’t think it’s bad, and in fact solves many issues quite elegantly, and I hope you understand a bit more about why it works the way it does.

If you’re upset with the games that got chosen as the top-rated games but didn’t vote on any entries yourself, it’s probably why you’re upset.

If you have better mathematical formulas for coming up with voting results, be sure to add them in the comments! We can make this system better together, I think, for future jams.

If you’re still upset, I hope that you try to see the good that came out of the jam and hope that for the most part people had a fun experience. Game Jolt is about doing things together. Let’s work on making things better through constructive input and positivity.

As more questions come in through the comments, I’ll try to update this post to answer them.

Next up

@CROS122

Ohh, what's this? You can see all the stickers in a sticker pack before buying/opening now. Finally.

@CROS82

Modded a Game Gear! Took a long time, but I'm really happy with the results. New screen, speakers, shell, rechargeable batteries, all that. Even has an FM radio chip in it so that when it runs Master System games, it has better music.