Predicting Ages of Abalone

K-Nearest Neighbors Regression

The first non-linear model that we tried was a k-Nearest Neighbors Regressor. This regression model requires an additional step of determining the most appropriate value of k to use for the regression. We graphed the testing accuracy for odd values of k from 1 to 19, and the graph appeared to level off around k = 7. We decided to try using k = 5, 7, and 9 to see if any of them produced better results than the multiple linear regressors.

For k = 5, we got an R2 value of 51.442%. For k = 7, we got an R2 value of 51.758%. For k = 9, we got an R2 value of 51.844%. None of the k-Nearest Neighbor models performed better than the multiple linear regression models. Additionally, the k-Nearest Neighbors models appeared to overfit the data, since the R2 values for the training data set were significantly higher than the R2 values for the testing data set.

Support Vector Regression

Our next approach was to use a Support Vector Regressor. This regressor resulted in an R2 value of 54.430%. This model performed better than any of the previous models, and did not overfit the data. We continued trying other models with the aim of producing a model that performs even better.

Random Forest Regression

Our next approach was to try a Random Forest Regressor to predict abalone age. Our first Random Forest Regression used 1,000 estimators, with each estimator having the maximum depth of 7 splits. This performed better than the previous models, producing an R2 value of 55.327%, but it also overfit the data.

We adjusted our second Random Forest model to have 500 estimators, each with a maximum depth of 5 splits. This improved the overfitting problem somewhat, but it also yielded a lower R2 value of 52.263%