Data-Driven Logistic Regression Ensembles

Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Effective statistical methods should excel in both prediction accuracy and biomarker identification. We introduce a novel approach to high-dimensional binary classification that integrates regularization with ensembling techniques. The method constructs compact ensembles of interpretable models derived by optimizing a global objective function. In medical genomics applications, the proposed approach identifies critical biomarkers overlooked by competing methods. We develop a variable importance ranking system to help researchers prioritize promising genes. The method's asymptotic properties are established, and an efficient computational algorithm is provided. Through extensive simulations across complex scenarios and analysis of cancer genomics datasets, we demonstrate strong predictive performance. Based on the numerical experiments, we offer practical guidelines for determining optimal ensemble size.