|国家预印本平台
首页|基于潜在语义分析和改进的原型算法的跨语言文本分类

基于潜在语义分析和改进的原型算法的跨语言文本分类

ross-Language Text Classification Using Latent Semantic Analysis and Improved Prototypes Algorithm

中文摘要英文摘要

随着经济全球化发展,越来越多的组织机构急需自动化归类多语言的文档,然而却受限于缺少已知类别的外文文档。跨语言的文本分类技术利用已知类标的源语言文档来分类目标语言的文档。传统跨语言文本分类多借助翻译技术,首先使用人工或者机器进行源语言与目标语言之间的翻译,再使用传统的文本分类方法。人工翻译需要高昂的人力成本,机器翻译又会引入很多翻译错误,此外,由于源文档和目标文档的边缘分布不一样,由此产生的主题漂移的问题也需要恰当地处理。在本文中,我们首先使用潜在语义分析技术将源语言和目标语言的文档映射到一个统一的语义空间。一旦建立了这一空间,更多的文档就可以不需要翻译直接映射到这一空间。然后,我们利用基于原型的高级算法来充分利用隐藏在目标文档中的信息来帮助跨语言的文本分类。这样,通过语义分析和挖掘隐藏信息我们就可以恰当地处理上述两个问题。在包含五种语言的Reuters RCV1/RCV2数据集上的一系列对比实验结果显示,在多种不同语言对上使用本文提出的算法都比传统采用翻译的方法F值性能提高很多,其中英文-法文对的F值提高9.5%,英文-西班牙文对的F值提高18.7%。

With the rapid development and globalization, more and more organizations have to manage a large amount of documents in different languages while suffering from lack of enough labeled data in corresponding target language. Cross-language text classification (CLTC) is to automatically classify documents in target language with the help of labeled documents in source language. Most previous research work adopted translation technology (manually or machine translation) between source and target languages before they applied traditional text classification methods. However, manual translation will cost highly and machine tranlsation will bring a lot of erros. Besides, the marginal distributions of the training data and test data may be non-identical, so the topic drift problem would be another challenge for CLTC. In this paper, we propose an approach to CLTC using latent semantic analysis to build a common intermediate semantic space thus the documents in different languages can be "folded-into" this space without translation. We also present an advanced prototype-based algorithm to make full use of the information behind the target documents to address the topic drifting problem. The preliminary experiments on the Reuters RCV1/RCV2 Multilingual collection with five languages show that our best result on Spanish-English language pair reaches 18.5% improvement in Macro-F over the traditional translation-based SVM.

赵江、兰曼

计算技术、计算机技术自动化技术、自动化技术设备

自然语言处理跨语言文本分类潜在语义分析原型算法

Natural Language ProcessCross-language text classificationLatent Sematic Analysisprototype classifier

赵江,兰曼.基于潜在语义分析和改进的原型算法的跨语言文本分类[EB/OL].(2013-01-22)[2025-07-17].http://www.paper.edu.cn/releasepaper/content/201301-956.点此复制

评论