Advanced Regression Models

K-Nearest Neighbors Regression

The first non-linear model that we tried was a k-Nearest Neighbors Regressor. This regression model requires an additional step of determining the most appropriate value of k to use for the regression. We graphed the testing accuracy for odd values of k from 1 to 19, and the graph appeared to level off around k = 7. We decided to try using k = 5, 7, and 9 to see if any of them produced better results than the multiple linear regressors.

For k = 5, we got an R² value of 51.442%. For k = 7, we got an R² value of 51.758%. For k = 9, we got an R² value of 51.844%. None of the k-Nearest Neighbor models performed better than the multiple linear regression models. Additionally, the k-Nearest Neighbors models appeared to overfit the data, since the R² values for the training data set were significantly higher than the R² values for the testing data set.

Support Vector Regression

Our next approach was to use a Support Vector Regressor. This regressor resulted in an R² value of 54.430%. This model performed better than any of the previous models, and did not overfit the data. We continued trying other models with the aim of producing a model that performs even better.

Random Forest Regression

Our next approach was to try a Random Forest Regressor to predict abalone age. Our first Random Forest Regression used 1,000 estimators, with each estimator having the maximum depth of 7 splits. This performed better than the previous models, producing an R² value of 55.327%, but it also overfit the data.

We adjusted our second Random Forest model to have 500 estimators, each with a maximum depth of 5 splits. This improved the overfitting problem somewhat, but it also yielded a lower R² value of 52.263%

Predicting Ages of Abalone

K-Nearest Neighbors Regression

Support Vector Regression

Random Forest Regression