基于特征相关度分析的不平衡数据混合采样方法
he mixed sampling method for imbalanced datasets based on feature correlation analysis
通过采样方式使数据信息平衡是解决数据不平衡问题的一种主要途径。本文基于混合采样的思想,提出了一种基于特征相关度分析的不平衡数据混合采样方法,首先,设定数据集重要特征相对于所有特征的占比系数,通过比较各特征属性与样本类别的相关系数,根据占比系数选择相关系数较大的特征子集作为重要特征集合;然后,取各类样本数据量的中位数作为分界点,标记各类样本的采样标记,结合所选择的重要特征子集,分别确定过采样和欠采样的采样策略;最后,设定采样平衡系数,结合采样前各类样本的数据量与样本量中位数的差值,确定各类样本采样后的数据量。文中采用三种常用但策略差异明显的分类算法,在两组KEEL公共数据集上,就未采样、利用常规混合采样方式及利用本文所提混合采样方式采样之后的数据集进行了对比实验。实验结果证明,本文所提方法处理之后的数据集分类准确率在三种分类算法下都大幅优于另外两组实验,验证了所提方法的有效性及可靠性。
ata information balance through sampling is a major way to solve the problem of data imbalance. Based on the idea of mixed sampling, a new method of mixed sampling for imbalanced datasets based on feature correlation analysis is proposed in this paper. First, set the ratio of the important features of the dataset to all features, by comparing the correlation coefficients of each feature attribute with the sample category, a feature subset with a large correlation coefficient is selected as an important feature set according to the ratio coefficient. Then, take the median of all kinds sample data volume as the demarcation point, and mark the sampling markers of each sample, and combined with the selected important feature subset, the sampling strategies of over-sampling and under-sampling are determined respectively. Finally, the sampling balance coefficient is set, and the difference between the data of each sample before sampling and the median of the sample size is used to determine the data of each sample after sampling. On two sets of KEEL common data sets, the data sets without sampling, using conventional mixed sampling method and using the mixed sampling method mentioned are compared with three commonly used classification algorithms with distinct strategy differences in this paper. The experimental results show that the classification accuracy of the data set processed by the proposed method is much better than that of the other two groups of experiments under the three classification algorithms, which verifies the effectiveness and reliability of the proposed method.
刁新平、何杨、高欣
计算技术、计算机技术
计算机应用技术不平衡数据集相关度分析重要特征子集混合采样样本数量平衡
computer application technologyimbalanced datasetscorrelation analysisimportant feature subsetmixed samplingsample quantity balance
刁新平,何杨,高欣.基于特征相关度分析的不平衡数据混合采样方法[EB/OL].(2019-03-13)[2025-08-02].http://www.paper.edu.cn/releasepaper/content/201903-140.点此复制
评论