|国家预印本平台
首页|基于Hadoop的K-Means聚类算法优化与实现

基于Hadoop的K-Means聚类算法优化与实现

Optimization and Realization of K-means Clustering Algorithm Based on Hadoop

中文摘要英文摘要

本文针对传统K-Means聚类算法不适合海量大数据挖掘,并且对异常离群点数据非常敏感,结合Hadoop云计算平台以及MapReduce并行编程框架,借鉴K-Medoids聚类算法对离群点数据不敏感的特点,提出了在Hadoop平台下改进的并行K-Means聚类算法,命名为HK-Means聚类算法。其中,map函数的主要任务是计算数据集合中每条数据记录到聚类中心点的距离并确定其所属聚类簇,reduce函数主要任务是完成更新聚类中心点。通过实验,验证了HK-Means聚类算法确实能降低时间复杂度,且表现出很好的稳定性。

ombined with hadoop cloud computing platform and MapReduce parallel programming framework, refered from K-Medoids clustering algorithm not sensitive to the outlier data, knowing the traditional K-Means clustering algorithm not suitable for large mass of data mining, this paper proposes an improved parallel K-Means clustering algorithm based on hadoop and name as HK-Means clustering algorithm. Design Map function to calculate the distance of each data record to each clustering center point and make them belong to one. Design Reduce function to update the clustering center points. Through the experiment, prove that compared with the traditional serial algorithm, HK-Means algorithm can indeed reduce the time complexity and also has good stability.

陈萍、何健伟

计算技术、计算机技术

K-Means算法大数据Hadoop并行

K-Means algorithmbig dataHadoopparallel

陈萍,何健伟.基于Hadoop的K-Means聚类算法优化与实现[EB/OL].(2014-12-18)[2025-05-19].http://www.paper.edu.cn/releasepaper/content/201412-521.点此复制

评论