The complexity of dealing with missing data, variable selection, and cross validation for prediction models

Part of my research is developing prediction models using medical data collected from electronic health records (EHR). Three of the largest issues I have to deal with are:

Missing predictor values. This is incredibly common when including lab measurements in a prediction model, as not all labs are collected on every patient.
Variable selection. If we only cared about predictive performance of a model, we wouldn’t need variable selection, but in the medical setting we have to contend with the fact that prediction models often require doctors to input a set of observed values into some risk calculator. The fewer values that have to be entered (and also measured on a patient), the more likely doctors and patients will be to use the model.
Internal validation of the model. While the model should be validated on an external dataset, internal validation to get an initial estimate of model performance is important to get buy-in for the model. A simple way to do this is dividing your data into a training and test set, but as shown in Elements of Statistical Learning and also Regression Modeling Strategies, cross-validation (10-fold or even repeated CV) is a superior strategy for this.

To see how the complexity of these problems adds up, it might be easier to work backwards. First, to do 10-fold cross-validation, this requires fitting our model (using all of the variables) on 10 different folds of the data, each containing 90% of our original data. Medical data tends to not be so large, at least with respect to how fast it takes modern statistical software to build models using this data, but this can take some time if we’re using complex models such as random forests. Furthermore, you may want to go even further than 10-fold cross-validation, and use repeated 10-fold cross-validation, or even bootstrapping. These methods could result in having to fit the model hundreds of times!

Conducting variable selection adds the next layer of complexity. There are different ways of doing this–LASSO, predictive feature selection, forward/backwards selection, selecting the most “important” variables from a random forest model, etc…– and most likely, each method will give a distinct list of variables to be used. And, as far as I know, there is not one single best method–most likely, you could conduct a simulation study that would show that your preferred methods performs the best! Furthermore, methods such as LASSO require some cross-validation to select a tuning parameter. Now combining this with cross-validation, within each model-fitting step you have to do cross-validation AGAIN, perform variable selection, and then potentially have to refit your model using your reduced variable set. Without even dealing with missing data, you can see that the number of models we fit is already very large, and we haven’t even considered this within the context of EDA and comparing different modeling approaches. The last thing I will say is that the list of variables selected within each cross-validation can (and probably will be) different than the list of variables selected using the full dataset. Thus, cross-validation gives you an estimate of model performance for the variable selection method, and not for the specific model that you fit with just the reduced set of variables.

Finally, we arrive to the problem of missing predictor values. The recommended solution is multiple imputation, implemented through the Multivariate Imputation through Chained Equations (MICE) algorithm. Without going into detail, the algorithm essentially sequentially goes through each variable with missing values, and uses the other predictor variables to build a model to impute the missing values. Once a variable has missing values imputed, then this variable is used for imputing the other variables. Because the order of this imputation creates different prediction models, this is done multiple (often 5-10) times, thus creating multiple datasets with different imputed values. Stef van Buuren has created a great resource on MICE, that also includes a description of different approaches for variable selection after performing MICE. A simple approach would be performing variable selection on each of the imputed datasets, and then selecting variables that are selected in a majority of the datasets. Of course, a model must then be re-fit using only these variables, and I would imagine that accurate inference is difficult in this scenario. More complicated procedures using penalization have recently been developed by Du et al. So now we layer on the complexity of before, by having to do everything 5-10 times WITHIN each cross-validation fitting step!

There are even more complications with multiple imputation–it is typically recommended to include the outcome as a predictor variable for imputation, but in the prediction modeling setting, this would most likely lead to optimistic bias in model performance estimation. One way to get around this is to build an imputation model including the outcome for the data being used as training data in each fold, and then use this to impute missing data for test data in the fold. However, doing this for each cross-validation means that you’re fitting even more models! You could also use the StackImpute method, which does not use the outcome for imputation, and instead uses a weighting method. Jaeger et al also discuss the important problem of whether you should impute the data during CV or before CV. Interestingly, the results show an important application of the bias-variance tradeoff, with the reduced variance from imputation BEFORE CV offsetting the increased bias, thus allowing us to at least cut down on some computational complexity.

One final point about multiple imputation is that we have to also specify what we want to estimate. If we want to estimate model performance (say AUC), we should be estimating the AUC within each of the multiply imputed datasets, and then averaging this estimate. This of course becomes complex when estimating calibration curves.

This was meant to be a brief overview of the complexities and available options for dealing with these three important issues in prediction model development. I would love to hear of your solutions/workflow, and any other relevant literature on these topics!