K-Nearest Neighbors Regression
The first non-linear model that we tried was a k-Nearest Neighbors Regressor. This regression
model requires an additional step of determining the most appropriate value of k to use for
the regression. We graphed the testing accuracy for odd values of k from 1 to 19, and the
graph appeared to level off around k = 7. We decided to try using k = 5, 7, and 9 to see if
any of them produced better results than the multiple linear regressors.
For k = 5, we got an R
2 value of 51.442%. For k = 7, we got an R
2
value of 51.758%. For k = 9, we got an R
2 value of 51.844%. None of the k-Nearest Neighbor models performed better
than the multiple linear regression models. Additionally, the k-Nearest Neighbors models
appeared to overfit the data, since the R
2 values for the training data set were
significantly higher than the R
2 values for the testing data set.
Support Vector Regression
Our next approach was to use a Support Vector Regressor. This regressor resulted in an R
2
value of 54.430%. This model performed better than any of the previous models, and did not
overfit the data. We continued trying other models with the aim of producing a model that
performs even better.
Random Forest Regression
Our next approach was to try a Random Forest Regressor to predict abalone age. Our first Random
Forest Regression used 1,000 estimators, with each estimator having the maximum depth of 7 splits.
This performed better than the previous models, producing an R
2 value of 55.327%,
but it also overfit the data.
We adjusted our second Random Forest model to have 500 estimators, each with a maximum depth
of 5 splits. This improved the overfitting problem somewhat, but it also yielded a lower R
2
value of 52.263%