The video game industry has seen numerous examples of commercial hits and catastrophic failures, setting precedents as to what attributes can contribute to a game’s success. PlayOn has compiled a list of data points from Steam, Twitch, Twitter, such as number of concurrent viewers/streamers of a specific game, number of server members for the game’s official discord server, and number of followers for the game’s Twitter page. This data is scraped from a game’s Steam page on a daily basis, compiled on PlayOn’s internal database, and contains around 50 million specific data points.
The main question to be addressed is what specific data point(s) contribute the most in predicting whether a game will become successful or not. The business case this addresses is narrowing down the vague idea of what makes a game “successful” so PlayOn can communicate with their clients any outstanding patterns or similar traits that successful games share. This question was approached with the following plan: develop a solid understanding of what the data consists of, clean/group the data, insert the data into our assigned models, and continuously adjust it to produce usable results.
The models we researched were: Brain.js and Tensorflow models with densely connected nodes, and a text classification focused Brain.js network. The results from these models will be used to give PlayOn a hypothetical understanding of what data aspects contribute to a game’s success.
This project involves two steps from data to results. First data would need to be cleaned and put into a usable format, and after that it would be passed to a neural network for training and evaluation.
The process for cleaning the data starts with Jupyter Notebooks with the Pandas library and has three core steps:
Neural networks are not able to work around missing data, so in order to still train on incomplete variables, we needed to fill in the missing values with token values so it has an input to work with. Given the volume of data we are working with, and how much of it is essentially junk input, this is not a huge leap in logic for the inputs to make. This was done with mean, mode, and zero fills and results were compared.
The data at this point would be in a .csv file and could now be passed to respective neural networks. Within the neural networks, which variables were being trained on could be compared, with the losses tracked and placed within excel files to compare accuracies. Additionally, alternate data fills to the default zero fill have their results tracked as well. Results were mostly tracked through their loss values as that gives the best overview of total error in the training process, and side steps the more 'common sense' metrics of performance like accuracy.