Part of my research is developing prediction models using medical data collected from electronic health records (EHR). Three of the largest issues I have to deal with are:

Missing predictor values. This is incredibly common when including lab measurements in a prediction model, as not all labs are collected on every patient.

Variable selection. If we only cared about predictive performance of a model, we wouldn’t need variable selection, but in the medical setting we have to contend with the fact that prediction models often require doctors to input a set of observed values into some risk calculator. The fewer values that have to be entered (and also measured on a patient), the more likely doctors and patients will be to use the model.

Internal validation of the model. While the model should be validated on an external dataset, internal validation to get an initial estimate of model performance is important to get buyin for the model. A simple way to do this is dividing your data into a training and test set, but as shown in Elements of Statistical Learning and also Regression Modeling Strategies, crossvalidation (10fold or even repeated CV) is a superior strategy for this.
To see how the complexity of these problems adds up, it might be easier to work backwards. First, to do 10fold crossvalidation, this requires fitting our model (using all of the variables) on 10 different folds of the data, each containing 90% of our original data. Medical data tends to not be so large, at least with respect to how fast it takes modern statistical software to build models using this data, but this can take some time if we’re using complex models such as random forests. Furthermore, you may want to go even further than 10fold crossvalidation, and use repeated 10fold crossvalidation, or even bootstrapping. These methods could result in having to fit the model hundreds of times!
Conducting variable selection adds the next layer of complexity. There are different ways of doing this–LASSO, predictive feature selection, forward/backwards selection, selecting the most “important” variables from a random forest model, etc…– and most likely, each method will give a distinct list of variables to be used. And, as far as I know, there is not one single best method–most likely, you could conduct a simulation study that would show that your preferred methods performs the best! Furthermore, methods such as LASSO require some crossvalidation to select a tuning parameter. Now combining this with crossvalidation, within each modelfitting step you have to do crossvalidation AGAIN, perform variable selection, and then potentially have to refit your model using your reduced variable set. Without even dealing with missing data, you can see that the number of models we fit is already very large, and we haven’t even considered this within the context of EDA and comparing different modeling approaches. The last thing I will say is that the list of variables selected within each crossvalidation can (and probably will be) different than the list of variables selected using the full dataset. Thus, crossvalidation gives you an estimate of model performance for the variable selection method, and not for the specific model that you fit with just the reduced set of variables.
Finally, we arrive to the problem of missing predictor values. The recommended solution is multiple imputation, implemented through the Multivariate Imputation through Chained Equations (MICE) algorithm. Without going into detail, the algorithm essentially sequentially goes through each variable with missing values, and uses the other predictor variables to build a model to impute the missing values. Once a variable has missing values imputed, then this variable is used for imputing the other variables. Because the order of this imputation creates different prediction models, this is done multiple (often 510) times, thus creating multiple datasets with different imputed values. Stef van Buuren has created a great resource on MICE, that also includes a description of different approaches for variable selection after performing MICE. A simple approach would be performing variable selection on each of the imputed datasets, and then selecting variables that are selected in a majority of the datasets. Of course, a model must then be refit using only these variables, and I would imagine that accurate inference is difficult in this scenario. More complicated procedures using penalization have recently been developed by Du et al. So now we layer on the complexity of before, by having to do everything 510 times WITHIN each crossvalidation fitting step!
There are even more complications with multiple imputation–it is typically recommended to include the outcome as a predictor variable for imputation, but in the prediction modeling setting, this would most likely lead to optimistic bias in model performance estimation. One way to get around this is to build an imputation model including the outcome for the data being used as training data in each fold, and then use this to impute missing data for test data in the fold. However, doing this for each crossvalidation means that you’re fitting even more models! You could also use the StackImpute method, which does not use the outcome for imputation, and instead uses a weighting method. Jaeger et al also discuss the important problem of whether you should impute the data during CV or before CV. Interestingly, the results show an important application of the biasvariance tradeoff, with the reduced variance from imputation BEFORE CV offsetting the increased bias, thus allowing us to at least cut down on some computational complexity.
One final point about multiple imputation is that we have to also specify what we want to estimate. If we want to estimate model performance (say AUC), we should be estimating the AUC within each of the multiply imputed datasets, and then averaging this estimate. This of course becomes complex when estimating calibration curves.
This was meant to be a brief overview of the complexities and available options for dealing with these three important issues in prediction model development. I would love to hear of your solutions/workflow, and any other relevant literature on these topics!