Here is the link to the website.

The link to the data used in the project is located in the "Data Collection" part of the project

Relationship between Price of ETH with the result of Steam hardware survey

Kai Yu by himself

Intro

This project is meant to investgate the relationship between the price of ETH, the virtral coin, with the video cards information in the Steam user survey with the attempt to build a predictive model and its examination.

Some explaination of the project

This topic might be ambiguous in the first case, but I am sure there exists some relation between video cards people are using with the price of ETH from my knowledge of economics. ETH is the most popular virtual coin that needs videocards to mine, and since the amount of videocards manufactures were able to make is relatively fixed, these miners are eating up the cards that were supposed to be purchased by the gamers.

There are two reasons that we choosed 5 years as the time restriction. The first one is that ETH isn't as ancient as bitcoins: it is a relatively newly invented coin that was found in 2013 and become worth 1$ each in 2015. I believe the relationship will be more obvious when ETH forms the majority of the mining-videocard market. The second one is that the videocards evolves and eliminate themselves when new generation of cards came out. The 5-year restriction makes it possible for us to focus on certain videocards that is influencial and allows easier comparison.

In [24]:
# This bar of code is meant to import all libraries we needed.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

Data Collection

The main dataset we are using is the steam hardware survey, a survey conducted by steam monthly to collect data about what kinds of computer hardware and software their customers are using. In this project, we are mainly focused on the different videocards people are using. In our raw data, we will have a list of different videocards that each followed by a percentage, repersenting the percentage of steam user using that type of videocard. Steam doesn't supply the historical data in a nice form, but luckily there is a selfishless contributer that monthly update the data into a github repository. Let us thank him: jdegene and his repository.

The second dataset we are using is the price of the ETH coin daily from Coingecko. In this project, in order to correspond with the data from steam, we are going to use the price from the first day of every month.

ETL with EDA

We loaded two csv data files included in this dataset folder from the sources refered above in the "Project Dataset" section.

In [25]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab Notebooks/project
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/project
In [26]:
eth_df = pd.read_csv('eth.csv')
steam_survey_df = pd.read_csv('shs.csv')
In [27]:
eth_df.head()
Out[27]:
snapped_at price market_cap total_volume
0 2015-08-07 00:00:00 UTC 2.831620 0.000000e+00 9.062200e+04
1 2015-08-08 00:00:00 UTC 1.330750 8.033948e+07 3.680700e+05
2 2015-08-10 00:00:00 UTC 0.687586 4.155631e+07 4.004641e+05
3 2015-08-11 00:00:00 UTC 1.067379 6.453901e+07 1.518998e+06
4 2015-08-12 00:00:00 UTC 1.256613 7.601326e+07 2.073893e+06

It's quite obvious that we do not need all the data in this dataframe of prices, and therefore we will tidy the dataframe in the next block.

In [28]:
# removing data not needed from the eth_df to tidy the dataframe

eth_df["Year"] = eth_df.snapped_at.str.split("-", expand=True)[0]
eth_df["Month"] = eth_df.snapped_at.str.split("-", expand=True)[1]
eth_df["Day"] = eth_df.snapped_at.str.split("-", expand=True)[2].str.split(" ", expand=True)[0]
eth_monthly_df = eth_df[eth_df['Day'] == "01"]
eth_monthly_df['Time'] = eth_monthly_df["Year"] + "-" + eth_monthly_df["Month"]
eth_monthly_df = eth_monthly_df.set_index('Time')
eth_monthly_df = eth_monthly_df[["price"]]
eth_monthly_df
<ipython-input-28-561e2e16105f>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eth_monthly_df['Time'] = eth_monthly_df["Year"] + "-" + eth_monthly_df["Month"]
Out[28]:
price
Time
2015-09 1.350787
2015-10 0.685812
2015-11 1.058542
2015-12 0.875004
2016-01 0.946313
... ...
2022-06 1944.078767
2022-07 1057.630999
2022-08 1682.011374
2022-09 1552.562966
2022-10 1329.146032

86 rows × 1 columns

Now this dataframe is tidy enough to do some visualization. To be fare I never realized that the up and downs for the price of ETH is this huge until I see this graph.

In [29]:
eth_monthly_df.price.plot.line(x='Month_index', y='price')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c3687f280>

Our next goal is the dataframe for the Steam Survey.

In [30]:
steam_survey_df.head()
Out[30]:
date category name change percentage
0 2008-11-01 AMD CPU Speeds 1.4 Ghz to 1.49 Ghz -0.0004 0.0036
1 2008-11-01 AMD CPU Speeds 1.5 Ghz to 1.69 Ghz -0.0025 0.0224
2 2008-11-01 AMD CPU Speeds 1.7 Ghz to 1.99 Ghz -0.0024 0.0714
3 2008-11-01 AMD CPU Speeds 2.0 Ghz to 2.29 Ghz -0.0040 0.1343
4 2008-11-01 AMD CPU Speeds 2.3 Ghz to 2.69 Ghz 0.0001 0.0727

Wow, wow. It seems that there are too much irrelevent data in this dataframe we are observing, as I believe that "Free Hard Drive Space" is not a great indicator of ETH prices. Lets start to clean it up.

In [31]:
# Cleaning up data from shs.csv, as we only need data for GPUs for a certain time period.
# But firstly, lets do some necceassary work for easier assign of index later.
steam_survey_df["Year"] = steam_survey_df.date.str.split("-", expand=True)[0]
steam_survey_df["Month"] = steam_survey_df.date.str.split("-", expand=True)[1]
steam_survey_df["Year"] = pd.to_numeric(steam_survey_df["Year"], errors="coerce")
steam_survey_df["Month"] = pd.to_numeric(steam_survey_df["Month"], errors="coerce")
steam_survey1_df = steam_survey_df[((steam_survey_df.Year > 2017) | ((steam_survey_df.Year == 2017) & (steam_survey_df.Month > 9))) & ((steam_survey_df.Year != 2022) | (steam_survey_df.Month != 10))]
GPU_df = steam_survey1_df[steam_survey_df.category == "Video Card Description"]

GPU_df.name.value_counts()
GPU_df
<ipython-input-31-d9663a3efb16>:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  GPU_df = steam_survey1_df[steam_survey_df.category == "Video Card Description"]
Out[31]:
date category name change percentage Year Month
31764 2017-10-01 Video Card Description AMD Radeon HD 7700 Series -0.0029 0.0048 2017 10
31765 2017-10-01 Video Card Description AMD Radeon HD 7800 Series -0.0015 0.0042 2017 10
31766 2017-10-01 Video Card Description AMD Radeon HD 7900 Series -0.0027 0.0041 2017 10
31767 2017-10-01 Video Card Description AMD Radeon HD 8370D -0.0009 0.0024 2017 10
31768 2017-10-01 Video Card Description AMD Radeon HD 8470D -0.0021 0.0038 2017 10
... ... ... ... ... ... ... ...
51114 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 0.0000 0.0164 2022 9
51115 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 Laptop GPU 0.0000 0.0018 2022 9
51116 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 Ti 0.0000 0.0072 2022 9
51117 2022-09-01 Video Card Description NVIDIA GeForce RTX 3090 -0.0001 0.0047 2022 9
51118 2022-09-01 Video Card Description Other -0.0041 0.0963 2022 9

6017 rows × 7 columns

It is worth to note that in the original plan we were supposed to include the data in the September of 2017. However it was found that there are some dataloss for that month as shown below.

In [32]:
# In the primary design we should collect data for 5 years. However after looking into primary data I found there are some data loss
# for the Sept.2017. Therefore our research can be only start at 2017.10

steam_20179_survey_df = steam_survey_df[(steam_survey_df.Year == 2017) & (steam_survey_df.Month == 9)]
GPU_20179_df = steam_20179_survey_df[steam_survey_df.category == "Video Card Description"]
GPU_20179_df

# As you can see theres only 4 line of data
<ipython-input-32-26f12f71cddf>:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  GPU_20179_df = steam_20179_survey_df[steam_survey_df.category == "Video Card Description"]
Out[32]:
date category name change percentage Year Month
31553 2017-09-01 Video Card Description NVIDIA GeForce GTX 1060 0.0117 0.0747 2017 9
31554 2017-09-01 Video Card Description NVIDIA GeForce GTX 750 Ti 0.0081 0.0603 2017 9
31555 2017-09-01 Video Card Description NVIDIA GeForce GTX 960 0.0090 0.0659 2017 9
31556 2017-09-01 Video Card Description Other 0.6720 0.7991 2017 9

The next part is the most time-consuming part of the project. We know that there are a number of GPUs that is not suitable for mining, as there is a hard requirement on the VRAM of the GPU. Therefore we have to do a manual selection to filter out those GPUs that are relatively "irrelavent" here. For a closer look please follow the next block of code.

In [33]:
#Since there is a hard requirement about VRAM on doing mining, we have to filter the cards with "mining related" and "non-related"

# First let's divide cards into AMD side and NVIDIA side
GPU_df["manu"] = GPU_df.name.str.split(" ", expand=True)[0]
GPU_df["mining"] = 0

# We know that all GPUs from NVIDIA named after "RTX" is capable for mining

GPU_df["type1"] = GPU_df.name.str.split(" ", expand=True)[2]
GPU_df.loc[GPU_df['type1'] == "RTX",["mining"]] = 1

# Next, We know that cards started with GTX after the 10 series are all capable except GTX1050 and GTX1050TI.

GPU_df["type2"] = GPU_df.name.str.split(" ", expand=True)[3]
# GPU_df.loc[(GPU_df['type1'] == "GTX" & (GPU_df['type2'] == "1080" | GPU_df['type2'] == "1070" | GPU_df['type2'] == "1060")) ,["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1080"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1070"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "GTX") & (GPU_df['type2'] == "1060"),["mining"]] = 1

# Next step, lets work on the AMD cards. We know that most cards started with RX are capable for mining except a few low-end card.
GPU_df.loc[(GPU_df['type1'] == "RX"), ["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "460"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "560"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "450"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "550"),["mining"]] = 0
GPU_df.loc[(GPU_df['type1'] == "RX") & (GPU_df['type2'] == "Vega"),["mining"]] = 0

# There are some problem with AMD R9 200 series. We know that 290 and 280 are capable for mining while 250 and 260 are not, but
# for some reason, they all shown the same in the survey like "200 series". Sadly these data will not count towards our program.

GPU_df.loc[(GPU_df['type1'] == "R9") & (GPU_df['type2'] == "390"),["mining"]] = 1
GPU_df.loc[(GPU_df['type1'] == "R9") & (GPU_df['type2'] == "380"),["mining"]] = 1

GPU_df
<ipython-input-33-e1544ff5d785>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GPU_df["manu"] = GPU_df.name.str.split(" ", expand=True)[0]
<ipython-input-33-e1544ff5d785>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GPU_df["mining"] = 0
<ipython-input-33-e1544ff5d785>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GPU_df["type1"] = GPU_df.name.str.split(" ", expand=True)[2]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1817: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
<ipython-input-33-e1544ff5d785>:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GPU_df["type2"] = GPU_df.name.str.split(" ", expand=True)[3]
Out[33]:
date category name change percentage Year Month manu mining type1 type2
31764 2017-10-01 Video Card Description AMD Radeon HD 7700 Series -0.0029 0.0048 2017 10 AMD 0 HD 7700
31765 2017-10-01 Video Card Description AMD Radeon HD 7800 Series -0.0015 0.0042 2017 10 AMD 0 HD 7800
31766 2017-10-01 Video Card Description AMD Radeon HD 7900 Series -0.0027 0.0041 2017 10 AMD 0 HD 7900
31767 2017-10-01 Video Card Description AMD Radeon HD 8370D -0.0009 0.0024 2017 10 AMD 0 HD 8370D
31768 2017-10-01 Video Card Description AMD Radeon HD 8470D -0.0021 0.0038 2017 10 AMD 0 HD 8470D
... ... ... ... ... ... ... ... ... ... ... ...
51114 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 0.0000 0.0164 2022 9 NVIDIA 1 RTX 3080
51115 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 Laptop GPU 0.0000 0.0018 2022 9 NVIDIA 1 RTX 3080
51116 2022-09-01 Video Card Description NVIDIA GeForce RTX 3080 Ti 0.0000 0.0072 2022 9 NVIDIA 1 RTX 3080
51117 2022-09-01 Video Card Description NVIDIA GeForce RTX 3090 -0.0001 0.0047 2022 9 NVIDIA 1 RTX 3090
51118 2022-09-01 Video Card Description Other -0.0041 0.0963 2022 9 Other 0 None None

6017 rows × 11 columns

Now we marked out all the "minable" GPUs. There were some data that is useful in that selection but no longer needed. Let us finish them.

In [34]:
# Cleaning up the dataframe
gpu1_df = GPU_df.loc[(GPU_df['mining'] == 1)]
gpu1_df["Time"] = gpu1_df.date.str.split("-", expand=True)[0] + "-" + gpu1_df.date.str.split("-", expand=True)[1]
gpu1_df = gpu1_df[["Time","name","change","percentage"]]
gpu1_df
<ipython-input-34-7bbf11be2e0f>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gpu1_df["Time"] = gpu1_df.date.str.split("-", expand=True)[0] + "-" + gpu1_df.date.str.split("-", expand=True)[1]
Out[34]:
Time name change percentage
31777 2017-10 AMD Radeon R9 380 Series -0.0013 0.0023
31778 2017-10 AMD Radeon R9 390 Series -0.0011 0.0017
31780 2017-10 AMD Radeon RX 470 -0.0008 0.0016
31781 2017-10 AMD Radeon RX 480 -0.0024 0.0037
31819 2017-10 NVIDIA GeForce GTX 1060 0.0382 0.1129
... ... ... ... ...
51113 2022-09 NVIDIA GeForce RTX 3070 Ti Laptop GPU 0.0006 0.0024
51114 2022-09 NVIDIA GeForce RTX 3080 0.0000 0.0164
51115 2022-09 NVIDIA GeForce RTX 3080 Laptop GPU 0.0000 0.0018
51116 2022-09 NVIDIA GeForce RTX 3080 Ti 0.0000 0.0072
51117 2022-09 NVIDIA GeForce RTX 3090 -0.0001 0.0047

1344 rows × 4 columns

Nice! This is the piece of information we needed. Now the problem we are facing is that, as one can tell from the dataframe, it is almost impossible to produce a graph that is meaningful if we want to plot different GPUs one by one, as there are so many of them.(Trust me, no one want to see that messy graph!) Therefore, for a straight-forward and understandable purpose, the next block of code will supply a graph where the total percentage of minable GPUs change with time.

In [35]:
# Get the sum of all minable card at one time and plot them

gpu3_df = gpu1_df.groupby("Time").sum()

axes = gpu3_df.plot.line(subplots=True)
type(axes)
Out[35]:
numpy.ndarray

To be fair, the first edition of the project was focusing on the sum of the minable GPUs. Although that idea was later abandoned, there are still something we can utilized out of it, like this clean correlation table below which shows us that the change is unreliable in the prediction.

In [36]:
# Actual merge
gpu3_df = gpu3_df.merge(eth_monthly_df, on=["Time"])
gpu3_df = gpu3_df.reset_index()
gpu3_df = gpu3_df.set_index("Time")
gpu3_df = gpu3_df.reset_index()
gpu3_df = gpu3_df.reset_index()
gpu2_df = gpu3_df[["index","percentage","change","price"]]
gpu2_df.corr()
Out[36]:
index percentage change price
index 1.000000 0.971512 -0.016450 0.692169
percentage 0.971512 1.000000 0.052833 0.561530
change -0.016450 0.052833 1.000000 -0.021742
price 0.692169 0.561530 -0.021742 1.000000

We have to admit that both the graph and table are clean and tidy. However, when doing the knn model to do prediction, we need the information about videocards to be seperate for a better result. Therefore the dataframe might be a little bit messy...

In [37]:
gpu1_df = gpu1_df.groupby(["Time","name"])["percentage"].sum().unstack().fillna(0)
gpu1_df
Out[37]:
name AMD Radeon R9 380 Series AMD Radeon R9 390 Series AMD Radeon RX 470 AMD Radeon RX 480 AMD Radeon RX 5500 XT AMD Radeon RX 5500M AMD Radeon RX 5600 XT AMD Radeon RX 570 AMD Radeon RX 5700 AMD Radeon RX 5700 XT ... NVIDIA GeForce RTX 3060 Laptop GPU NVIDIA GeForce RTX 3060 Ti NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3070 Laptop GPU NVIDIA GeForce RTX 3070 Ti NVIDIA GeForce RTX 3070 Ti Laptop GPU NVIDIA GeForce RTX 3080 NVIDIA GeForce RTX 3080 Laptop GPU NVIDIA GeForce RTX 3080 Ti NVIDIA GeForce RTX 3090
Time
2017-10 0.0023 0.0017 0.0016 0.0037 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2017-11 0.0018 0.0000 0.0000 0.0029 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2017-12 0.0019 0.0000 0.0015 0.0030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-01 0.0017 0.0000 0.0000 0.0026 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-02 0.0019 0.0000 0.0000 0.0029 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-03 0.0025 0.0018 0.0018 0.0042 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-04 0.0034 0.0027 0.0025 0.0062 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-05 0.0035 0.0029 0.0026 0.0064 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-06 0.0035 0.0028 0.0025 0.0063 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-07 0.0031 0.0025 0.0023 0.0058 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-08 0.0035 0.0027 0.0025 0.0065 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-09 0.0033 0.0026 0.0025 0.0062 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-10 0.0031 0.0024 0.0025 0.0060 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-11 0.0033 0.0026 0.0026 0.0065 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2018-12 0.0033 0.0025 0.0025 0.0062 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-01 0.0031 0.0024 0.0026 0.0061 0.0000 0.0000 0.0000 0.0022 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-02 0.0030 0.0023 0.0027 0.0059 0.0000 0.0000 0.0000 0.0027 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-03 0.0030 0.0023 0.0027 0.0063 0.0000 0.0000 0.0000 0.0033 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-04 0.0028 0.0022 0.0031 0.0060 0.0000 0.0000 0.0000 0.0043 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-05 0.0028 0.0023 0.0031 0.0060 0.0000 0.0000 0.0000 0.0044 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-06 0.0028 0.0022 0.0039 0.0062 0.0000 0.0000 0.0000 0.0060 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-07 0.0027 0.0021 0.0035 0.0059 0.0000 0.0000 0.0000 0.0068 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-08 0.0027 0.0021 0.0031 0.0061 0.0000 0.0000 0.0000 0.0067 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-09 0.0026 0.0020 0.0032 0.0062 0.0000 0.0000 0.0000 0.0072 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-10 0.0024 0.0019 0.0033 0.0059 0.0000 0.0000 0.0000 0.0082 0.0000 0.0015 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-11 0.0025 0.0019 0.0031 0.0056 0.0000 0.0000 0.0000 0.0079 0.0000 0.0022 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2019-12 0.0023 0.0019 0.0028 0.0057 0.0000 0.0000 0.0000 0.0084 0.0000 0.0026 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-01 0.0025 0.0020 0.0032 0.0060 0.0000 0.0000 0.0000 0.0097 0.0015 0.0044 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-02 0.0023 0.0017 0.0033 0.0055 0.0000 0.0000 0.0000 0.0100 0.0016 0.0047 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-03 0.0022 0.0016 0.0033 0.0055 0.0000 0.0000 0.0000 0.0108 0.0019 0.0051 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-04 0.0022 0.0017 0.0034 0.0057 0.0000 0.0000 0.0000 0.0113 0.0020 0.0060 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-05 0.0021 0.0016 0.0034 0.0055 0.0000 0.0000 0.0000 0.0120 0.0021 0.0066 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-06 0.0020 0.0016 0.0035 0.0053 0.0000 0.0000 0.0000 0.0129 0.0023 0.0074 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-07 0.0019 0.0000 0.0037 0.0053 0.0000 0.0000 0.0000 0.0139 0.0024 0.0080 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-08 0.0018 0.0000 0.0033 0.0052 0.0000 0.0000 0.0000 0.0139 0.0024 0.0088 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-09 0.0018 0.0000 0.0033 0.0053 0.0000 0.0000 0.0018 0.0144 0.0026 0.0095 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-10 0.0019 0.0000 0.0034 0.0050 0.0016 0.0000 0.0019 0.0141 0.0024 0.0093 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2020-11 0.0018 0.0000 0.0032 0.0048 0.0019 0.0000 0.0022 0.0142 0.0024 0.0094 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0022 0.0000 0.0000 0.0000
2020-12 0.0000 0.0000 0.0027 0.0040 0.0020 0.0000 0.0022 0.0125 0.0020 0.0086 ... 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0045 0.0000 0.0000 0.0000
2021-01 0.0016 0.0000 0.0032 0.0046 0.0029 0.0000 0.0032 0.0148 0.0024 0.0103 ... 0.0000 0.0025 0.0000 0.0000 0.0000 0.0000 0.0063 0.0000 0.0000 0.0022
2021-02 0.0016 0.0000 0.0031 0.0045 0.0032 0.0000 0.0032 0.0145 0.0023 0.0102 ... 0.0000 0.0033 0.0000 0.0000 0.0000 0.0000 0.0074 0.0000 0.0000 0.0029
2021-03 0.0016 0.0000 0.0030 0.0044 0.0034 0.0000 0.0034 0.0149 0.0022 0.0101 ... 0.0000 0.0036 0.0124 0.0000 0.0000 0.0000 0.0081 0.0000 0.0000 0.0032
2021-04 0.0000 0.0000 0.0032 0.0043 0.0034 0.0000 0.0033 0.0150 0.0022 0.0097 ... 0.0000 0.0038 0.0133 0.0000 0.0000 0.0000 0.0082 0.0000 0.0000 0.0036
2021-05 0.0016 0.0000 0.0030 0.0042 0.0036 0.0000 0.0034 0.0148 0.0021 0.0096 ... 0.0020 0.0039 0.0142 0.0000 0.0000 0.0000 0.0086 0.0000 0.0000 0.0036
2021-06 0.0000 0.0000 0.0026 0.0037 0.0031 0.0000 0.0029 0.0131 0.0018 0.0080 ... 0.0051 0.0035 0.0148 0.0019 0.0000 0.0000 0.0081 0.0000 0.0000 0.0034
2021-07 0.0000 0.0000 0.0025 0.0037 0.0030 0.0000 0.0029 0.0131 0.0018 0.0080 ... 0.0065 0.0041 0.0153 0.0023 0.0000 0.0000 0.0085 0.0000 0.0000 0.0037
2021-08 0.0000 0.0000 0.0023 0.0034 0.0027 0.0000 0.0025 0.0117 0.0015 0.0070 ... 0.0094 0.0053 0.0155 0.0028 0.0023 0.0000 0.0081 0.0000 0.0015 0.0036
2021-09 0.0000 0.0000 0.0024 0.0035 0.0029 0.0000 0.0027 0.0123 0.0017 0.0078 ... 0.0130 0.0066 0.0164 0.0035 0.0027 0.0000 0.0096 0.0000 0.0025 0.0041
2021-10 0.0000 0.0000 0.0025 0.0035 0.0030 0.0000 0.0028 0.0125 0.0017 0.0079 ... 0.0136 0.0079 0.0165 0.0037 0.0030 0.0000 0.0100 0.0000 0.0030 0.0043
2021-11 0.0000 0.0000 0.0023 0.0036 0.0029 0.0000 0.0027 0.0123 0.0016 0.0077 ... 0.0175 0.0095 0.0174 0.0046 0.0035 0.0000 0.0108 0.0000 0.0037 0.0044
2021-12 0.0000 0.0000 0.0021 0.0038 0.0028 0.0016 0.0025 0.0115 0.0015 0.0073 ... 0.0168 0.0125 0.0187 0.0048 0.0040 0.0000 0.0107 0.0000 0.0038 0.0042
2022-01 0.0000 0.0000 0.0021 0.0037 0.0027 0.0018 0.0024 0.0113 0.0016 0.0072 ... 0.0199 0.0117 0.0175 0.0054 0.0041 0.0000 0.0107 0.0000 0.0041 0.0042
2022-02 0.0000 0.0000 0.0020 0.0036 0.0027 0.0018 0.0024 0.0110 0.0000 0.0070 ... 0.0227 0.0126 0.0180 0.0062 0.0049 0.0000 0.0111 0.0000 0.0044 0.0041
2022-03 0.0000 0.0000 0.0018 0.0035 0.0025 0.0017 0.0024 0.0107 0.0000 0.0071 ... 0.0215 0.0149 0.0204 0.0061 0.0062 0.0000 0.0123 0.0015 0.0049 0.0044
2022-04 0.0000 0.0000 0.0019 0.0035 0.0026 0.0018 0.0023 0.0110 0.0000 0.0072 ... 0.0257 0.0150 0.0187 0.0069 0.0073 0.0000 0.0123 0.0017 0.0054 0.0045
2022-05 0.0000 0.0000 0.0019 0.0036 0.0027 0.0017 0.0025 0.0115 0.0016 0.0078 ... 0.0259 0.0165 0.0207 0.0070 0.0087 0.0000 0.0147 0.0018 0.0065 0.0051
2022-06 0.0000 0.0000 0.0018 0.0033 0.0025 0.0017 0.0023 0.0103 0.0000 0.0068 ... 0.0289 0.0166 0.0195 0.0074 0.0088 0.0000 0.0140 0.0017 0.0061 0.0045
2022-07 0.0000 0.0000 0.0020 0.0036 0.0029 0.0000 0.0026 0.0117 0.0016 0.0079 ... 0.0325 0.0185 0.0223 0.0083 0.0104 0.0000 0.0164 0.0018 0.0075 0.0052
2022-08 0.0000 0.0000 0.0016 0.0032 0.0024 0.0016 0.0021 0.0100 0.0000 0.0067 ... 0.0327 0.0193 0.0224 0.0077 0.0105 0.0018 0.0164 0.0018 0.0072 0.0048
2022-09 0.0000 0.0000 0.0016 0.0029 0.0022 0.0015 0.0020 0.0092 0.0000 0.0063 ... 0.0386 0.0204 0.0226 0.0083 0.0107 0.0024 0.0164 0.0018 0.0072 0.0047

60 rows × 48 columns

This is how the entire dataframe looks like. We are going to merge this with the price dataframe which allows us to build a model out of it in the next section.

In [38]:
# Merge data here with the data from the ETH price.
# Some preperation to make similiar format of data
gpu1_df = gpu1_df.reset_index()
gpu1_df = gpu1_df.reset_index()

# Actual merge
gpu2_df = gpu1_df.merge(eth_monthly_df, on=["Time"])
gpu2_df.head()
Out[38]:
index Time AMD Radeon R9 380 Series AMD Radeon R9 390 Series AMD Radeon RX 470 AMD Radeon RX 480 AMD Radeon RX 5500 XT AMD Radeon RX 5500M AMD Radeon RX 5600 XT AMD Radeon RX 570 ... NVIDIA GeForce RTX 3060 Ti NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3070 Laptop GPU NVIDIA GeForce RTX 3070 Ti NVIDIA GeForce RTX 3070 Ti Laptop GPU NVIDIA GeForce RTX 3080 NVIDIA GeForce RTX 3080 Laptop GPU NVIDIA GeForce RTX 3080 Ti NVIDIA GeForce RTX 3090 price
0 0 2017-10 0.0023 0.0017 0.0016 0.0037 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 299.674889
1 1 2017-11 0.0018 0.0000 0.0000 0.0029 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 292.330539
2 2 2017-12 0.0019 0.0000 0.0015 0.0030 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 469.863480
3 3 2018-01 0.0017 0.0000 0.0000 0.0026 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 776.080330
4 4 2018-02 0.0019 0.0000 0.0000 0.0029 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1106.138172

5 rows × 51 columns

The Model

The goal of this project is to build a pridictive model that pridict the price of ETH with the videocards people are using. After it we can run some analysis and check if our model works. If we reached a positive result, then we can announce that the model is buildable and there is a relationship between the price and the videocards people using.

Firstly, we should select the correct k value to use. To do this we run a error analysis to check where do the lowest error appear.

In [39]:
# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = gpu1_df.columns.to_list()
X_dict = gpu2_df[features].to_dict(orient="records")
y = gpu2_df["price"]

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=10)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

# calculates estimate of test error based on 10-fold cross validation
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_dict, y, 
        cv=10, scoring="neg_mean_squared_error"
    ))
    return mse
    
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()

# According to this, we should select 2-nearest neighbor
Out[39]:
2     3.853555e+05
1     4.936272e+05
3     5.127928e+05
10    5.654536e+05
5     5.671676e+05
6     5.849864e+05
8     5.900740e+05
9     5.946717e+05
11    6.013919e+05
4     6.176275e+05
12    6.330194e+05
13    6.390203e+05
7     6.454238e+05
14    6.617963e+05
15    6.952350e+05
16    6.992996e+05
17    7.267150e+05
18    7.707532e+05
19    7.964468e+05
20    8.264556e+05
21    8.721644e+05
22    8.764426e+05
23    8.937190e+05
24    9.102366e+05
25    9.322279e+05
26    9.648371e+05
27    9.921911e+05
28    1.015427e+06
29    1.043338e+06
30    1.073274e+06
31    1.097883e+06
32    1.117335e+06
33    1.131684e+06
34    1.152017e+06
35    1.175076e+06
36    1.199736e+06
37    1.222803e+06
38    1.247678e+06
39    1.279414e+06
40    1.308594e+06
41    1.345659e+06
42    1.375562e+06
43    1.409627e+06
44    1.459220e+06
45    1.496547e+06
46    1.530694e+06
47    1.571892e+06
48    1.632691e+06
49    1.684954e+06
50    1.724338e+06
dtype: float64

According to the result above, now we test with a 2-fold validation about the exact value of error when k = 2.

In [40]:
# next, we will build a 2-nearest neighbor model and check its cross-validation error

def get_val_error(X_train_dict, y_train, X_val_dict, y_val):
    
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 2-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=2)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse

def get_cross(a, b, c, d):
  return (get_val_error(a, b, c, d) + get_val_error(c, d, a, b))/2

train = gpu2_df.sample(frac=.5)
val = gpu2_df.drop(train.index)

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["price"]
y_val = val["price"]

get_cross(X_train_dict, y_train, X_val_dict, y_val)

# seems acceptable
Out[40]:
458.9752099894905

I will say that this error is acceptable. Also this is a better result compared to the model built from the sum of all percentages, where the error is at somewhere around 580.

Finally, lets actually build the graph and see how it works with a graph.

In [41]:
# Lets see how our model works with a graph

features = gpu1_df.columns.to_list()
X_train_dict = gpu2_df[features].to_dict(orient="records")
y_train = gpu2_df["price"]

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# Fit a 2-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=2)
model.fit(X_train_sc, y_train)

# Calculate the model predictions on the training data.
y_train_pred = model.predict(X_train_sc)
y_train_pred

dict = {'real_price': y_train, 'predicted_price': y_train_pred} 
df_com = pd.DataFrame(dict)
df_com

axes = df_com.plot.line(subplots=True)
type(axes)
Out[41]:
numpy.ndarray

With this graph, We believe that it is allowed for us to say the we achieved our goal as these two graph are so similiar, which indicate that our hypothesis is true and there is a relationship between the x and y here.

Conclusion and some further insight

With the result in the model section, We succeassfully built a model and checked its accuracy. However, as indicated in the beginning of this project, since ETH officially stopped their support of the "merge" model, which suggest that it can no longer be mined with videocards, we remain pessimistic about the accuracy of this model in the future. We will call this "Shattered dreams of wealth"

Thanks for reading!

In [42]:
%%shell
jupyter nbconvert --to html FinalProject.ipynb
[NbConvertApp] Converting notebook FinalProject.ipynb to html
[NbConvertApp] Writing 504374 bytes to FinalProject.html
Out[42]: