Abilities of Statistical Models to Identify Subjects with Ghost Prognosis Factors

Nguyen JM, Gaultier A and A

Abstract

Background Many tools are available to estimate prediction quality, but none are available to assess the ability, of a predictive model to identify completely missing or unknown prognostic factors, designated as ghost factors (GFs). However, it may be possible to predict whether a subject carries a GF. Methods To simulate the presence of a GF, a significant prognostic factor and all variables correlated with it were removed prior to model analysis. Public datasets and simulated data were used. A predictive statistical model was developed to assess the relationship between the presence of a GF and the predictive capacity of a given model based on the correlation between predicted outcome and GF presence. Five statistical models were compared using this procedure. Results After evaluating 6 real databases, the only statistical method consistently able to identify subjects with GFs was the use of optimized regression models. Using simulated, linearly correlated data, optimized regression models exhibited up to a 92% success rate, whereas conventional linear models had less than 53% success. Random forest and classification tree models had the highest success rates compared to the other evaluated models. Conclusions Model-based outcome prediction was assessed with respect to the presence of GFs. As GFs are unknown, only subjects who are carriers of significant unknown prognostic factors can be identified. As complex models outperformed linear models in identifying GF presence, we assume that the associations between GFs and outcome-predictive factors are also complex and not linear.

Relevant Publications in Journal of Health Education Research & Development