On feature selection in double-imbalanced data settings: a Random Forest approach
On feature selection in double-imbalanced data settings: a Random Forest approach
Feature selection is a critical step in high-dimensional classification tasks, particularly under challenging conditions of double imbalance, namely settings characterized by both class imbalance in the response variable and dimensional asymmetry in the data $(n \gg p)$. In such scenarios, traditional feature selection methods applied to Random Forests (RF) often yield unstable or misleading importance rankings. This paper proposes a novel thresholding scheme for feature selection based on minimal depth, which exploits the tree topology to assess variable relevance. Extensive experiments on simulated and real-world datasets demonstrate that the proposed approach produces more parsimonious and accurate subsets of variables compared to conventional minimal depth-based selection. The method provides a practical and interpretable solution for variable selection in RF under double imbalance conditions.
Fabio Demaria
计算技术、计算机技术
Fabio Demaria.On feature selection in double-imbalanced data settings: a Random Forest approach[EB/OL].(2025-06-12)[2025-07-16].https://arxiv.org/abs/2506.10929.点此复制
评论