|国家预印本平台
首页|中文分词中间件的设计与实现

中文分词中间件的设计与实现

esign and implementation of middleware for Chinese word segmentation

中文摘要英文摘要

中文分词技术是中文信息处理的基础,具有广泛的应用场景。开发者在确定采用何种中文分词系统之前,通常需要比较不同中文分词系统的切分效率和准确度。由于各种中文分词系统由不同的组织开发,没有统一的标准,因此在编程语言、切分原理、依赖的词典、功能和暴露出的接口等方面存在很大的差异。这无疑增加了开发人员的学习成本和开发成本。另外,确定采用何种分词系统后,集成部分的代码和应用的其他部分耦合性较高,若日后切换成其他分词,又需要对原来的代码进行大量的修改。为了降低应用开发人员的开发成本和学习成本,减少系统耦合度,本文设计并实现了一款中文分词中间件。 该中间件除了默认配备三种性能较高、较常用的分词组件外,还可以添加新的分词组件。该中间件抽象出各个分词系统具备的共性功能,屏蔽底层分词模块,对上层应用提供统一的接口。开发者只需要简单的配置,就可以选择切换至不同的分词模块,大大降低代码耦合度。通过调用统一的接口和进行简单的配置,该中间件可以无缝地融合到开发者的各类复杂应用系统中。 最后,基于该中间件实现了数字资源检索系统(DCS)来验证中间件,实验证明该中间件具有一定的实用价值。?????

hinese word segmentation is one of the fundamental technology in Chinese information processing, and it is frequently used in many fields. Application developers need to compare the efficiency and accuracy of different Chinese word segmentation systems before decide which one to adopt. There are many differences between Chinese word segmentation systems in many respects, such as programming language, algorithm, dictionary, function and interface, because these systems are developed by different project groups without a common standard. These differences lead to the increase of the learning cost and development cost of application developers. In addition, because of coupling, the developers have to reprogram if they want to replace one segementation system with the other segmentation system . In order to cut down the development cost, reduce coupling, this paper designs and implements a middleware for Chinese word segmentation. The middleware supports three Chinese word segmentation systems, including ICTCLAS, Jieba and Pychseg, by default. The middleware has good expansibility to support the other Chinese word segmentation systems. The middle provides uniform, standard, high-level interfaces to the upper layer application developers and integrators. Besides, the middleware also hides the heterogeneity of the underlying various Chinese word segmentation systems. In addition, application developers just need simple configuration to change one Chinese word segmentation system to the others. The middleware can be integrated into all kinds of complex applications. DCS, a search engine for digital resources, was developed to test and verify the performance of the middleware. The results shows that the middleware can be applied to the actual projects.

鄂海红、宋俊德、张静

计算技术、计算机技术

中文分词python中间件

hinese word segmentationpythonmiddleware

鄂海红,宋俊德,张静.中文分词中间件的设计与实现[EB/OL].(2012-12-17)[2025-08-06].http://www.paper.edu.cn/releasepaper/content/201212-371.点此复制

评论