Prognostic proteomic models for low event rates: A case study with myocardial infarction


Large-scale clinical proteomics provide increasing opportunities for patient risk stratification, especially with multi-marker models derived using machine learning techniques. Prognostic models can be developed as binary risk classifiers, or by using time-to-event data. Survival modeling is ubiquitous in statistical literature, but support for machine learning optimization is more limited in comparison to other regression techniques. We have developed and assessed a novel prognostic model development method combining two statistical techniques – survival analysis and subsampling – using existing machine learning tools in R. These methods were applied to a clinical dataset to identify a highly predictive proteomic model for myocardial infarction (MI) despite a low observed event rate.


Cox elastic net with subsampling tools were developed in R. Simulations were used to demonstrate the utility and accuracy of subsampling in a survival data context, with comparisons made to logistic regression, Cox elastic net (Coxnet), and SVM models. Following the validation of the approach via simulations, models were developed and assessed on the HUNT3 data set (n = 756), which had 61 (8.1%) MI events within four years of blood draw. Proteomic measurements were performed using SomaScan® v4.0


Simulation results and analysis of HUNT3 data set show improved performance metrics using the subsampled survival method.


Survival analysis with subsampling is a novel combination of techniques that can be applied to proteomic data to improve biomarker discovery and predictive modeling in the context of relatively low incidence rates.Using these methods, sensitivity and specificity metrics were more balanced on real-world hold-out test sets, and simulation results showed improved discrimination metrics using subsampled survival analysis. Additionally, simulations showed that the proteins that were most highly correlated with MI were selected for final models, indicating that this method is a promising tool for clinical discovery and prognostic/diagnostic development.


Y. Hagar
L.E. Alexander
J. Chadwick
G. Datta
M.A. Hinterberg

SomaLogic Operating Co., Inc., Boulder, CO USA

Share with colleagues

More posters

PosterThe Plasma Proteome as a Cardiovascular Disease Risk Assessment Tool in Cancer Survivors

Cardiovascular disease (CVD) is the most common non-cancer cause of death in cancer survivors and there is an unmet clinical need for easy, accurate, and safe CVD prognostic risk-stratification in adult cancer survivors. This study investigated whether a previously validated 27-plasma protein prognostic model for four-year cardiovascular (CV) events could have such a utility.

Learn more

PosterEfficient development of prognostic tests for detecting cancer risk using proteomic technology

Prognostic models for assessing future health outcomes can be developed using time-to-event (also known as “survival”) data. This methodology is ubiquitous in statistical literature and in the analysis of cancer outcomes, but its use in high-dimensional analyses tends to be limited as the methods are difficult to implement in a machine learning environment. Additionally, development of certified prognostic clinical tests using proteomic biomarkers for detecting future cancer risk can be time-consuming, prone to overfitting issues, and difficult to navigate. We demonstrate the utility of combining SomaScan® proteomic data with pipeline machine learning tools and survival analysis methodology to identify powerful and robust LDT-certifiable prognostic tests for assessing future risk of cancer.

Learn more

PosterPredicting risk of future events in individuals with chronic coronary syndromes

Evaluate whether a previously validated 27-protein prognostic model for four-year cardiovascular event risk can be used to stratify patients with suspected chronic coronary syndrome (CCS)

Learn more

Explore posters in our interactive viewer