KNN Regression
As part of my senior project analysis with Colorado Climate Data provided by NRCS, I decided to replace some missing values with estimates created from the KNN regression provided through SKLEARN.
In my dataset, I have 15 day high temperature averages for 114 stations, many of whose observations had readings which are probably not correct, as shown by the below boxplot.
As seen from above, the black and red dots show the supposed outliers, whose values we will re-predict with the KNN regression method.
With KNN regression we can take the middle 98% or so of the data (which we believe to be a lot more accurate readings), and use that to find rows in the corrupted data to with like attributes, and estimate what should have been it’s actual value.
The following with be provided to reference: the r script used to format the data, the python script used to run the actual regression, and the r script used to do some visualization and data wrangling. Keep in mind that I also estimated values for other problems I also had in the data.
The regression created predictions based off of combinations of elevation and month, or in other words, we made predictions for every month of every elevation. But before deciding to do that, as the python script shows, I validated the good data by proper splitting and randomizing techniques, and obtained a .85 correlation from predictions to targets, demonstrating supervised learning.
The following boxplot shows re-predicted “bad” values in red and black, among the already good data.
This could also be done using multiple regression, among other methods.