面向机构知识库结构化数据的文本相似度评价算法
Text Similarity Evaluation Algorithm for Structured Data of Institutional Repository
机构知识库中文本数据集多呈现结构化,且具有离散性,对此本文提出一种文本相似度评价算法。通过分析DC(Dublin Core)元数据格式,筛选其中有效数据,计算特定词语在指定域中的权重并统计匹配次数,在文本长度归一化的基础上进行文本相似度计算。算法验证以手动建立的文本测试集作为实验数据,经统计分析,该算法具有可行性,能够对结构化离散文本数据的相似度进行合理计算。
he paper presents a text similarity evaluation algorithm in consideration of the structured and discretized text data sets of institutional repository. It filters invalid data by analyzing DC(Dublin Core) metadata format, calculates the right weight of certain words in specified domain and counts the number of matches. The text similarity can be calculated based on the normalization of the length of texts. The paper validates the feasibility of algorithm by using experimental data created manually and the algorithm is proved that it can calculate the similarity of structured text data reasonably.
颉夏青、许晋、郭芳毓、吴旭
计算技术、计算机技术
算法理论数据分析权重计算词语匹配文本相似度机构知识库
lgorithm theoryData analysisWeight calculationWord matchingText similarityInstitutional Repository
颉夏青,许晋,郭芳毓,吴旭.面向机构知识库结构化数据的文本相似度评价算法[EB/OL].(2015-03-31)[2025-08-10].http://www.paper.edu.cn/releasepaper/content/201503-430.点此复制
评论