首页|基于spark平台的K-means改进算法

基于spark平台的K-means改进算法

he advanced K-means based on spark

来源：

中文摘要

英文摘要

K-means算法是较为经典的聚类算法。针对经典的K-means算法存在的K值个数和初始聚类中心需要人为指定的缺陷，以及经典的串行K-means算法在面对海量数据时性能不足的问题，提出了一种canopy-Kmeans算法。该算法引入canopy算法，作为K-means算法的前置算法，得到初始聚类中心点和K 值，并结合并行化编程框架 Spark ，实现算法的并行化,充分利用spark的内存计算优势，提高聚类效率。通过实验表明，canopy-Kmeans算法相较于传统的串行K-means算法和未经改进的并行算法，在准确率和效率上均有提升。

iming at the problem that the number of K values and initial clusteringcenter in classical K-means algorithmneed to be artificially specified and that classical serial K-means algorithm in the face of massive data, a canopy- Kmeans algorithm is raised. The algorithm introduces the canopy algorithm as a pre-algorithm of the K-means algorithm to get the initial clustering center point and K value, and combines the Spark framework to parallelize the algorithm. It takes full advantage of Spark\'s memory computing advantages and improves the clustering efficiency. Experiments show that the canopy-Kmeans algorithm has higher accuracy and efficiency than the traditional K-means algorithm and unmodified parallel algorithm.

作者：闫萌、邹俊伟

作者单位：

学科分类：计算技术、计算机技术

中文关键词：聚类算法，K-means算法，并行化，spark

英文关键词：clustering algorithmK-means algorithmparallelizationspark

推荐引用：闫萌,邹俊伟.基于spark平台的K-means改进算法[EB/OL].(2017-12-05)[2025-08-16].http://www.paper.edu.cn/releasepaper/content/201712-50.点此复制

基于spark平台的K-means改进算法

he advanced K-means based on spark

评论