Accounting for multiplicity in machine learning benchmark performance
Accounting for multiplicity in machine learning benchmark performance
State-of-the-art (SOTA) performance refers to the highest performance achieved by some model on a test sample, preferably under controlled conditions such as public data (reproducibility) or public challenges (independent sample). Thousands of classifiers are applied, and the highest performance becomes the new reference point for a particular problem. In effect, this set-up is an estimate of the expected best performance among all classifiers applied to a random sample; a sample maximum estimate. In this paper, we argue that SOTA should instead be estimated by the expected performance of the best classifier, which can be done without knowing which classifier it is. Our contribution is the formal distinction between the two, and an investigation into the practical consequences of using the former to estimate the latter. This is done by presenting sample maximum estimator distributions for non-identical and dependent classifiers. We illustrate the impact on real world examples from public challenges.
Kajsa Møllersen、Einar Holsbø
计算技术、计算机技术
Kajsa Møllersen,Einar Holsbø.Accounting for multiplicity in machine learning benchmark performance[EB/OL].(2025-07-14)[2025-07-25].https://arxiv.org/abs/2303.07272.点此复制
评论