The goal of this section is to demonstrate whether or not our statistic for wind categorization can be predicted using a machine learning model. We chose a random forest model, as our variable is categorical with more than two choices. In building the model, we tuned the following hyperparameters using cross validation: number of trees, minimum samples required to be at a leaf node, and the number of features to consider at each split in a tree. During this hyperparameter tuning step, which is typically time consuming for large datasets such as ours, we used parallel computing. Future directions may consider different machine learning models to improve accuracy.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidymodels)
Warning: package 'tidymodels' was built under R version 4.4.2
STATE YEAR MONTH_NAME EVENT_TYPE
0.130906821 0.059272694 0.029812890 0.069625612
MAGNITUDE MAGNITUDE_TYPE storm_time_duration change_lon
0.133540752 0.036821833 0.059528004 0.014531206
change_lat
0.008056711
Conclusion
In the end, our model with the best hyperparameters had an accuracy of 0.6898446 and a kappa value – which measures how much better our model is than random chance (on a scale from 0 to 1) – of 0.4593361. The most important features in the model were wind magnitude and the state the event took place in. The least important features were changes in latitude and longitude, but that could be due to the fact that many wind events did not record changes in latitude and longitude. Because our accuracy and kappa values were higher than we would expect by random chance, we have shown that our statistic for wind categorization is predictable and future directions may be pursued.