一种基于改进特征加权的朴素贝叶斯分类算法
传统朴素贝叶分类算法没有根据特征项的不同对其重要程度进行划分,使得分类结果不准确。针对这一问题,引入Jensen-Shannon(JS)散度,用JS散度来表示特征项所能提供的信息量,并针对JS散度存在的不足,从类别内与类别间的词频、文本频以及用变异系数修正过的逆类别频率这三个方面考虑,对JS散度进行调整修正,最后计算出每一特征项的权值,将权值带入到朴素贝叶斯的公式中。通过与其他算法的对比实验证明,基于JS散度并从词、文本、类别三方面改进后的朴素贝叶斯算法的分类效果最好。因此基于JS散度特征加权的朴素贝叶斯分类算法与其他分类算法相比,其分类性能有很大提高。
he traditional Naive Bayes classification algorithm does not divide the importance degree according to the different feature items, which makes the classification result inaccurate. In order to solve this problem, this paper introduces Jensen-Shannon (JS) divergence and uses JS divergence to express the amount of information provided by the feature terms. Aiming at the deficiency of JS divergence, the paper consider from the three aspects of word frequency, text frequency and inverse category frequency corrected by coefficient of variation, the JS divergence is adjusted and corrected. The weights are introduced into the naive Bayes formula. Compared with other algorithms, it is proved that this method improves the naive Bias classification algorithm effectively. Therefore, compared with other classification algorithms, the performance of naive Bayesian classification algorithm based on JS divergence feature weighting is greatly improved.
汪学明、丁月
计算技术、计算机技术
文本分类朴素贝叶斯JS散度词频文本频率类别频率
汪学明,丁月.一种基于改进特征加权的朴素贝叶斯分类算法[EB/OL].(2018-10-11)[2025-08-03].https://chinaxiv.org/abs/201810.00074.点此复制
评论