|国家预印本平台
首页|基于Spark的改进K-means算法的并行实现

基于Spark的改进K-means算法的并行实现

中文摘要英文摘要

针对K-means聚类算法存在的不足,提出了改进K-means来提高算法的性能,利用简化后的轮廓系数作为评估标准衡量K-means算法中k值,采用K-means++完成K-means算法初始中心点的选择。设置好k值以及初始中心点后使用形态学相似距离作为相似度测量标准将数据点归属到距离最近的中心点形成的簇中,最后计算平均轮廓系数确定合适的k值,并在Spark上实现算法并行化。通过对四个标准数据集在准确性,运行时间和加速比三个方面的实验表明,改进后的K-means算法相对于传统的K-means算法和SKDK-means算法不仅提高了聚类划分质量,缩短了计算时间,而且在多节点的集群环境下表现出良好的并行性能。实验结果分析出提出的改进算法能有效提高算法执行效率和并行计算能力。

iming at the deficiency of K-means clustering algorithm, this paper proposes an improved algorithm with the use of simplified silhouette coefficient as the evaluation criterion to measure the k value in K-means to boost the algorithm performance. The K-means++ algorithm is used to choose the initial center points in the K-means algorithm. After setting the k value and the initial center point, morphology similarity distance is used as the similarity measurement standard to assign the data points to the cluster formed by the closest center point. And finally calculate the average silhouette coefficient to determine the appropriate k value. The improved algorithm in this paper is implemented on Spark. Experiments on accuracy, run-time and speedup of four standard datasets show that the improved K-means algorithm not only improves the quality of clustering division compared with the traditional K-means algorithm and SKDK-means algorithm, but also shortens the calculation time, showing good parallel performance in a multi-node cluster environment. The experimental results suggest that the improved algorithm proposed in this paper can effectively improve the algorithm execution efficiency and parallel computing ability.

段文影、卜秋瑾、段隆振、杜佳颖

10.12074/201812.00114V1

计算技术、计算机技术

聚类算法简化轮廓系数形态学相似距离相似性度量

段文影,卜秋瑾,段隆振,杜佳颖.基于Spark的改进K-means算法的并行实现[EB/OL].(2018-12-13)[2025-08-02].https://chinaxiv.org/abs/201812.00114.点此复制

评论