[PM4] A Test and Validation of Genetic Algorithms and Cross-Validation in Variable Selection and Model Building [Podium presentation]

[PM4] A Test and Validation of Genetic Algorithms and Cross-Validation in Variable Selection and Model Building [Podium presentation]

2017 ISPOR 22nd Annual International Conference

Zur, R. | Sherman, S. | Aballéa, S. | Volume: , Issue: , Pages: ,

OBJECTIVES: When developing statistical models to predict health care costs, resource utilization or clinical outcomes, obtaining a reliable set of predictors that most impact the outcome can be challenging. The objective of this analysis is to test and validate genetic algorithms (GA) for variable selection with integrated cross-validation (CV) for a prediction regression model. This proposed method was compared to forward selection (FS). METHODS: A simulation study was performed to test and validate the integrated GA and CV (GA-CV) algorithms with repeated random selection of 50 test sets. To overcome variability from different random folds, 20 different random selections were performed. The optimal set of variables was identified based on the proportion of times each variable was included in the models that minimized the mean squared error of the predictions. Number of events was modeled from Poisson distributions in this exercise, and included a treatment variable (yes vs. no), 3 integer covariates, and 3 continuous covariates. The covariates were either: 1) unrelated to the outcome, 2) moderately associated with the outcome, or 3) highly associated with the outcome. RESULTS: The GA-CV algorithm selected the covariates associated with the outcomes in 96% of the simulations, and did not select the covariates unrelated to the outcomes in 57% of the simulations, compared to 83% and 55% of the simulations for FS algorithms. CONCLUSIONS: The GA-CV algorithm successfully identified covariates associated with outcomes while avoiding covariates not associated with outcomes in a simulation study, performing better than FS for identifying impactful variables, and equivalently to FS for identifying non-impactful variables. The integrated GACV algorithm should be considered when building models of count data and should be studied for its effectiveness when modeling other outcomes.