结合改进的CHI统计方法的TF-IDF算法优化
特征项的选择和特征权值的计算是文本分类过程中两个至关重要的环节,对文本分类的结果起关键性作用。为了克服传统的CHI统计方法存在特征项出现频率与类别负相关的情况和某一个特征项存在于某一个文本中的概率问题,针对传统的CHI统计方法引入了负相关判定、频度等重要因素进行了改进,并结合语义相似度的计算方法对TF-IDF算法进行了优化,在WEKA软件上采用了KNN(K-nearest neighbor)分类器和支持向量机(SVM)分类器分别对微博情感语料进行分类,该实验结果表明,新方法在文本分类的准确性上有明显的提高。
he selection of feature items and the calculation of feature weights are two crucial links in the process of text classification and play a key role in the results of text classification. In order to overcome the traditional CHI statistical method, there is a negative correlation between the frequency of feature items and the category, and a probability problem that a feature item exists in a text, The traditional CHI statistical method is improved by introducing some important factors such as negative correlation judgment and frequency, and the TF-IDF algorithm is optimized by combining the calculation method of semantic similarity. The K-nearest neighbor (KNN) classifier and support vector machine (SVM) classifier are respectively used in WEKA software to classify the Weibo emotional corpus The experimental results show that the new method has obvious improvement on the accuracy of text classification.
马莹、赵辉、李万龙、崔岩、庞海龙
计算技术、计算机技术
文本分类HI统计F-IDF算法特征选择
马莹,赵辉,李万龙,崔岩,庞海龙.结合改进的CHI统计方法的TF-IDF算法优化[EB/OL].(2018-05-24)[2025-08-02].https://chinaxiv.org/abs/201805.00488.点此复制
评论