Here is the link to the website.
The link to the data used in the project is located in the "Data Collection" part of the project
Kai Yu by himself
This project is meant to investgate the relationship between the price of ETH, the virtral coin, with the video cards information in the Steam user survey with the attempt to build a predictive model and its examination.
This topic might be ambiguous in the first case, but I am sure there exists some relation between video cards people are using with the price of ETH from my knowledge of economics. ETH is the most popular virtual coin that needs videocards to mine, and since the amount of videocards manufactures were able to make is relatively fixed, these miners are eating up the cards that were supposed to be purchased by the gamers.
There are two reasons that we choosed 5 years as the time restriction. The first one is that ETH isn't as ancient as bitcoins: it is a relatively newly invented coin that was found in 2013 and become worth 1$ each in 2015. I believe the relationship will be more obvious when ETH forms the majority of the mining-videocard market. The second one is that the videocards evolves and eliminate themselves when new generation of cards came out. The 5-year restriction makes it possible for us to focus on certain videocards that is influencial and allows easier comparison.
# This bar of code is meant to import all libraries we needed.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
The main dataset we are using is the steam hardware survey, a survey conducted by steam monthly to collect data about what kinds of computer hardware and software their customers are using. In this project, we are mainly focused on the different videocards people are using. In our raw data, we will have a list of different videocards that each followed by a percentage, repersenting the percentage of steam user using that type of videocard. Steam doesn't supply the historical data in a nice form, but luckily there is a selfishless contributer that monthly update the data into a github repository. Let us thank him: jdegene and his repository.
The second dataset we are using is the price of the ETH coin daily from Coingecko. In this project, in order to correspond with the data from steam, we are going to use the price from the first day of every month.
We loaded two csv data files included in this dataset folder from the sources refered above in the "Project Dataset" section.
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab Notebooks/project
eth_df = pd.read_csv('eth.csv')
steam_survey_df = pd.read_csv('shs.csv')
eth_df.head()
It's quite obvious that we do not need all the data in this dataframe of prices, and therefore we will tidy the dataframe in the next block.
# removing data not needed from the eth_df to tidy the dataframe
eth_df["Year"] = eth_df.snapped_at.str.split("-", expand=True)[0]
eth_df["Month"] = eth_df.snapped_at.str.split("-", expand=True)[1]
eth_df["Day"] = eth_df.snapped_at.str.split("-", expand=True)[2].str.split(" ", expand=True)[0]
eth_monthly_df = eth_df[eth_df['Day'] == "01"]
eth_monthly_df['Time'] = eth_monthly_df["Year"] + "-" + eth_monthly_df["Month"]
eth_monthly_df = eth_monthly_df.set_index('Time')
eth_monthly_df = eth_monthly_df[["price"]]
eth_monthly_df
Now this dataframe is tidy enough to do some visualization. To be fare I never realized that the up and downs for the price of ETH is this huge until I see this graph.
eth_monthly_df.price.plot.line(x='Month_index', y='price')
Our next goal is the dataframe for the Steam Survey.
steam_survey_df.head()
Wow, wow. It seems that there are too much irrelevent data in this dataframe we are observing, as I believe that "Free Hard Drive Space" is not a great indicator of ETH prices. Lets start to clean it up.
# Cleaning up data from shs.csv, as we only need data for GPUs for a certain time period.
# But firstly, lets do some necceassary work for easier assign of index later.
steam_survey_df["Year"] = steam_survey_df.date.str.split("-", expand=True)[0]
steam_survey_df["Month"] = steam_survey_df.date.str.split("-", expand=True)[1]
steam_survey_df["Year"] = pd.to_numeric(steam_survey_df["Year"], errors="coerce")
steam_survey_df["Month"] = pd.to_numeric(steam_survey_df["Month"], errors="coerce")
steam_survey1_df = steam_survey_df[((steam_survey_df.Year > 2017) | ((steam_survey_df.Year == 2017) & (steam_survey_df.Month > 9))) & ((steam_survey_df.Year != 2022) | (steam_survey_df.Month != 10))]
GPU_df = steam_survey1_df[steam_survey_df.category == "Video Card Description"]
GPU_df.name.value_counts()
GPU_df
It is worth to note that in the original plan we were supposed to include the data in the September of 2017. However it was found that there are some dataloss for that month as shown below.
# In the primary design we should collect data for 5 years. However after looking into primary data I found there are some data loss
# for the Sept.2017. Therefore our research can be only start at 2017.10
steam_20179_survey_df = steam_survey_df[(steam_survey_df.Year == 2017) & (steam_survey_df.Month == 9)]
GPU_20179_df = steam_20179_survey_df[steam_survey_df.category == "Video Card Description"]
GPU_20179_df
# As you can see theres only 4 line of data
The next part is the most time-consuming part of the project. We know that there are a number of GPUs that is not suitable for mining, as there is a hard requirement on the VRAM of the GPU. Therefore we have to do a manual selection to filter out those GPUs that are relatively "irrelavent" here. For a closer look please follow the next block of code.
#Since there is a hard requirement about VRAM on doing mining, we have to filter the cards with "mining related" and "non-related"
# First let's divide cards into AMD side and NVIDIA side
GPU_df["manu"] = GPU_df.name.str.split(" ", expand=True)[0]
GPU_df["mining"] = 0
# We know that all GPUs from NVIDIA named after "RTX" is capable for mining
GPU_df["type1"] = GPU_df.name.str.split(" ", expand=True)[2]
GPU_df.loc[GPU_df['type1'] == "RTX",["mining"]] = 1
# Next, We know that cards started with GTX after the 10 series are all capable except GTX1050 and GTX1050TI.
GPU_df["type2"] = GPU_df.name.str.split(" ", expand=True)[3]
# GPU_df.loc[(GPU_df['type1'] == "GTX" & (GPU_df['type2'] == "1080" | GPU_df['type2'] == "1070" | GPU_df['type2'] == "1060")) ,["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1080"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1070"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1060"),["mining"]] = 1
# Next step, lets work on the AMD cards. We know that most cards started with RX are capable for mining except a few low-end card.
GPU_df.loc[(GPU_df['type1'] == "RX"), ["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "460"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "560"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "450"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "550"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "Vega"),["mining"]] = 0
# There are some problem with AMD R9 200 series. We know that 290 and 280 are capable for mining while 250 and 260 are not, but
# for some reason, they all shown the same in the survey like "200 series". Sadly these data will not count towards our program.
GPU_df.loc[(GPU_df['type1'] == "R9") & (GPU_df['type2'] == "390"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "R9") & (GPU_df['type2'] == "380"),["mining"]] = 1
GPU_df
Now we marked out all the "minable" GPUs. There were some data that is useful in that selection but no longer needed. Let us finish them.
# Cleaning up the dataframe
gpu1_df = GPU_df.loc[(GPU_df['mining'] == 1)]
gpu1_df["Time"] = gpu1_df.date.str.split("-", expand=True)[0] + "-" + gpu1_df.date.str.split("-", expand=True)[1]
gpu1_df = gpu1_df[["Time","name","change","percentage"]]
gpu1_df
Nice! This is the piece of information we needed. Now the problem we are facing is that, as one can tell from the dataframe, it is almost impossible to produce a graph that is meaningful if we want to plot different GPUs one by one, as there are so many of them.(Trust me, no one want to see that messy graph!) Therefore, for a straight-forward and understandable purpose, the next block of code will supply a graph where the total percentage of minable GPUs change with time.
# Get the sum of all minable card at one time and plot them
gpu3_df = gpu1_df.groupby("Time").sum()
axes = gpu3_df.plot.line(subplots=True)
type(axes)
To be fair, the first edition of the project was focusing on the sum of the minable GPUs. Although that idea was later abandoned, there are still something we can utilized out of it, like this clean correlation table below which shows us that the change is unreliable in the prediction.
# Actual merge
gpu3_df = gpu3_df.merge(eth_monthly_df, on=["Time"])
gpu3_df = gpu3_df.reset_index()
gpu3_df = gpu3_df.set_index("Time")
gpu3_df = gpu3_df.reset_index()
gpu3_df = gpu3_df.reset_index()
gpu2_df = gpu3_df[["index","percentage","change","price"]]
gpu2_df.corr()
We have to admit that both the graph and table are clean and tidy. However, when doing the knn model to do prediction, we need the information about videocards to be seperate for a better result. Therefore the dataframe might be a little bit messy...
gpu1_df = gpu1_df.groupby(["Time","name"])["percentage"].sum().unstack().fillna(0)
gpu1_df
This is how the entire dataframe looks like. We are going to merge this with the price dataframe which allows us to build a model out of it in the next section.
# Merge data here with the data from the ETH price.
# Some preperation to make similiar format of data
gpu1_df = gpu1_df.reset_index()
gpu1_df = gpu1_df.reset_index()
# Actual merge
gpu2_df = gpu1_df.merge(eth_monthly_df, on=["Time"])
gpu2_df.head()
The goal of this project is to build a pridictive model that pridict the price of ETH with the videocards people are using. After it we can run some analysis and check if our model works. If we reached a positive result, then we can announce that the model is buildable and there is a relationship between the price and the videocards people using.
Firstly, we should select the correct k value to use. To do this we run a error analysis to check where do the lowest error appear.
# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = gpu1_df.columns.to_list()
X_dict = gpu2_df[features].to_dict(orient="records")
y = gpu2_df["price"]
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=10)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
# calculates estimate of test error based on 10-fold cross validation
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(
pipeline, X_dict, y,
cv=10, scoring="neg_mean_squared_error"
))
return mse
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
test_errs.sort_values()
# According to this, we should select 2-nearest neighbor
According to the result above, now we test with a 2-fold validation about the exact value of error when k = 2.
# next, we will build a 2-nearest neighbor model and check its cross-validation error
def get_val_error(X_train_dict, y_train, X_val_dict, y_val):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 2-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=2)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
return rmse
def get_cross(a, b, c, d):
return (get_val_error(a, b, c, d) + get_val_error(c, d, a, b))/2
train = gpu2_df.sample(frac=.5)
val = gpu2_df.drop(train.index)
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["price"]
y_val = val["price"]
get_cross(X_train_dict, y_train, X_val_dict, y_val)
# seems acceptable
I will say that this error is acceptable. Also this is a better result compared to the model built from the sum of all percentages, where the error is at somewhere around 580.
Finally, lets actually build the graph and see how it works with a graph.
# Lets see how our model works with a graph
features = gpu1_df.columns.to_list()
X_train_dict = gpu2_df[features].to_dict(orient="records")
y_train = gpu2_df["price"]
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# Fit a 2-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=2)
model.fit(X_train_sc, y_train)
# Calculate the model predictions on the training data.
y_train_pred = model.predict(X_train_sc)
y_train_pred
dict = {'real_price': y_train, 'predicted_price': y_train_pred}
df_com = pd.DataFrame(dict)
df_com
axes = df_com.plot.line(subplots=True)
type(axes)
With this graph, We believe that it is allowed for us to say the we achieved our goal as these two graph are so similiar, which indicate that our hypothesis is true and there is a relationship between the x and y here.
With the result in the model section, We succeassfully built a model and checked its accuracy. However, as indicated in the beginning of this project, since ETH officially stopped their support of the "merge" model, which suggest that it can no longer be mined with videocards, we remain pessimistic about the accuracy of this model in the future. We will call this "Shattered dreams of wealth"
Thanks for reading!
%%shell
jupyter nbconvert --to html FinalProject.ipynb