|国家预印本平台
首页|一种基于独立类别特性的改进KNN文本分类算法

一种基于独立类别特性的改进KNN文本分类算法

n Improved KNN Text Categorization Algorithm based on Independent Category Feature

中文摘要英文摘要

在万维网迅猛发展的今日,数据挖掘中的文本分类技术业已成为组织和处理海量文档的关键技术。KNN算法稳定性好、准确率高,作为一种简单、有效、非参数的分类方法,已在各个行业领域得到广泛应用。尤其在信息安全领域,KNN已成为过滤文本内容的重要算法之一。类偏斜问题是数据挖掘中的常见问题。当训练样本存在类偏斜问题时,KNN分类器会将本该属于小类的样本错误地分到大类,引起宏F1指标下降。针对这一问题,人们提出了各种解决方案。本文提出了一种基于独立类别特性的改进KNN文本分类算法,针对不同的欲判类别,采用不同的K值计算置信度。实验结果显示,此种方法具有较好的性能。

With the fast development of World Wide Web, text categorization in data mining has become the key technology in organizing and processing large amount of document data. K-Nearest Neighbor(KNN) algorithm, as a simple, effective and nonparametric classification method, has the advantage of high stability and accuracy and is widely used in various industry sectors. Especially in the field of information security, KNN has become one of the important algorithms to filter text content. Class imbalance is one of the problems plagueing practitioners in data mining community. When training set is skewed, the popular KNN text classifier will mislabel instances in rare categories into common ones and lead to degradation in macro F1. Many solutions have been proposed. In this paper, an improved KNN text categorization algorithm based on independent category feature is presented. The experiment shows that it has good performance.

王慧亮、辛阳

计算技术、计算机技术

信息安全文本分类KNN自然语言处理

Information SecurityText CategorizationKNNNatural Language Processing

王慧亮,辛阳.一种基于独立类别特性的改进KNN文本分类算法[EB/OL].(2012-07-31)[2025-08-02].http://www.paper.edu.cn/releasepaper/content/201207-332.点此复制

评论