基于spark平台的K-means改进算法
he advanced K-means based on spark
K-means算法是较为经典的聚类算法。针对经典的K-means算法存在的K值个数和初始聚类中心需要人为指定的缺陷,以及经典的串行K-means算法在面对海量数据时性能不足的问题,提出了一种canopy-Kmeans算法。该算法引入canopy算法,作为K-means算法的前置算法,得到初始聚类中心点和K 值,并结合并行化编程框架 Spark ,实现算法的并行化,充分利用spark的内存计算优势,提高聚类效率。通过实验表明,canopy-Kmeans算法相较于传统的串行K-means算法和未经改进的并行算法,在准确率和效率上均有提升。
iming at the problem that the number of K values and initial clusteringcenter in classical K-means algorithmneed to be artificially specified and that classical serial K-means algorithm in the face of massive data, a canopy- Kmeans algorithm is raised. The algorithm introduces the canopy algorithm as a pre-algorithm of the K-means algorithm to get the initial clustering center point and K value, and combines the Spark framework to parallelize the algorithm. It takes full advantage of Spark\'s memory computing advantages and improves the clustering efficiency. Experiments show that the canopy-Kmeans algorithm has higher accuracy and efficiency than the traditional K-means algorithm and unmodified parallel algorithm.
闫萌、邹俊伟
计算技术、计算机技术
聚类算法,K-means算法,并行化,spark
clustering algorithmK-means algorithmparallelizationspark
闫萌,邹俊伟.基于spark平台的K-means改进算法[EB/OL].(2017-12-05)[2025-08-16].http://www.paper.edu.cn/releasepaper/content/201712-50.点此复制
评论