Have you recently seen a new game on Steam and tried guessing how successful it’s going to be? Maybe you’re developing your own game or just thinking about making one and are wondering the same. What if we looked at existing games and let a computer learn from them? It could then analyze your concept and give you an estimate of sales. And you could make changes to maximize the sales.

I decided to make this the subject of my research. Here’s what I learned.

Data

The research was conducted on a dataset of games released on Steam containing nearly 10,000 entries until July 2016. The data was eventually reduced to games released since September 2013 when Steam Greenlight changed its policy to allow more games on the platform.

Free-to-play and Early Access titles were excluded, as they gain profit in different ways than traditional premium games. After all filtering, over 4,200 games were eventually left in the dataset. The information about each game was mostly downloaded from the Steam website. I had two choices of a measure of success: owners from Steam Spy, and concurrent players from Steam Charts. Unfortunately, historical data from Steam Spy isn’t available, hence I decided to use Steam Charts to calculate the average concurrent players in the first two months after release, which is as useful as owners but is reasonably available.

The key attributes describing each game are:

  • Name
  • Developer
  • Publisher
  • Age Requirements
  • Release Date
  • Price
  • Description
  • Platforms
  • Game/Steam Features
  • HW Requirements
  • Languages
  • Genres
  • Thumbnail
  • Screenshots
  • User Tags
  • Concurrent Players (steamcharts.com)

Stats

Let’s look at the premium games for which I had data from Steam Charts (August 2012-July 2016, roughly 4,600 games). In the past years, around 30 percent of games had less than 1 player on average after release while only a few reached over 10,000 players on average, as the graph below shows.

Average number of concurrent players in games (released August 2012-July 2016):

Above: Average number of concurrent players in games (released August 2012-July 2016):

Since Greenlight, we’ve been seeing more games that virtually no one plays. As you can see below, 2013 was still a manageable year for Valve, but then game makers flooded Steam with hundreds of games. Unfortunately, I don’t have data covering the whole 2016, but it would likely show a similar trend of an increasing number of games no one ever plays.

I looked at games released in 2015 and compared values of some attributes with how the games were successful. It showed that Multiplayer, Trading Cards, and Achievements are far more often present in the more successful games. For obvious reasons, games with higher hardware requirements and price generally sell better — higher budget helps a lot.

Steam genres are way too generic and don’t show any significant differences. User tags, on the other hand, are far more interesting. Those that stand out as unsuccessful include: 2D, pixel graphics, platformer, point-and-click, puzzle, retro. Games with these tags generally do pretty poorly.

Games with the following tags have been mostly successful: third-person (shooter), first-person (shooter), open world, sandbox, survival, story-rich, fantasy, sci-fi, zombies.

Predictions

The dataset went through a process of adjusting and deriving new attributes, totaling at over 200 attributes. I then used machine learning methods to predict the average concurrent players. I tried both regression (predicting the exact number) and classification (dividing the games into two groups). SVM and Random Forest were generally providing the best results.

Regression gave me 0.7 correlation (0 means no correlation and 1 means perfect fit) and 72 percent root relative squared error, or RRSE (100 percent is bad, the less the better). For classification, I tried detecting games with more than 10 players on average (to see if the “better” ones can be separated). I only detected 39 percent of them while getting 19 percent wrong (formally: 81 percent precision and 39 percent recall, the overall accuracy was 86 percent with 79 percent baseline).

These aren’t exactly amazing results, although it shows that there definitely is a correlation between metadata about a game and its success. But I decided to find a subset of games on which the predictions would work better.

Such subset turned out to be games from developers/publishers having at least two games already released. When evaluating an upcoming game from such a developer/publisher, I know how their previous games were doing, which itself isn’t enough to make an accurate prediction but it significantly helps.

With this criterion, I can cover around a third of Steam games. For regression, this means an improvement to 0.82 correlation and 58 percent RRSE. Classification gave me 79 percent precision and 55 percent recall, meaning the algorithm was slightly more wrong about games having more than 10 players on average but missed less games (overall accuracy 83 percent with 72 percent baseline). I definitely liked regression more as dividing games into groups means there are a lot of games close to the the other group on both sides.

The following features turned out to have the largest impact on the predictions (careful: presence or higher value doesn’t necessarily mean higher prediction):

  • Minimum and maximum of average players across previous games by the same developer/publisher, and Gini index of these numbers
  • GPU and storage requirements
  • Tags: Open World, Third-Person, Sandbox, Story-Rich
  • Support for Spanish, French, Polish, Italian, Russian, German, Portuguese-Brazil and the total number of supported languages
  • Genres: Indie and Casual
  • Presence of DRM in any form and presence of EULA
  • Length of description
  • Launch price
  • Age requirements
  • Whether the game is a sequel
  • Average saturation and number of distinct colors of the first displayed screenshot
  • Presence of multi-player

There is an application for predictions available but I wouldn’t recommend using it for any serious business decisions because I had to keep the predicted intervals pretty wide (and it can still be inaccurate). Think of it more of as a toy showing how e.g. adding some features increases the predicted number of players. (Also, I wasn’t doing this research to create an end-user application so apologies if it crashes on some inputs).

Conclusion

While it’s not possible to make exact predictions about a game’s success knowing only basic descriptive information about it, there is a strong correlation. If I were to give some general advice according to my research, it would be the following:

If you haven’t released anything successful on Steam yet, you’re gonna have a hard time. It doesn’t mean you have no chance, but you’ll just need to put a lot of effort into your game and inform the right people about it. And maybe have a bit of luck.

Gamers generally buy high-budget games despite their higher price. Low-budget titles don’t require nearly as many sales to fund their development but since there’s an excessive number of them, it’s very hard to stand out with a 2D platformer, for instance. Hence, it could be worth investing more time and money into the development process. This is supported by the fact that third-person, open-world, and story-focused games generally attract a lot of players. Also, including multiplayer helps a lot.

Michal Trněný is a fresh Master of Computer Science with passion for data and gaming.

The PC Gaming channel is presented by Intel®'s Game Dev program.