基于Spark的两表等值连接过程优化
在数据统计分析查询中表间的等值连接是常用的操作之一,但代价较高。大数据环境下大表之间等值连接的效率更低。为了解决该问题,提出了一种基于Spark的两表等值连接过程优化方法。首先根据数据价值密度特征构建Bloom Filter完成表的过滤操作;其次结合Simi-Join和Partition Join两者的优势,对过滤后的单侧表使用贪心算法进行拆分;最后对拆分后的子集进行连接,因此把两大表的连接过程转换为分阶段进行的两小表连接。代价分析和实验结果表明该算法与现有基于Spark的连接操作相比不仅在性能上得到了提升而且当出现数据倾斜时对算法效率影响较小。
he equivalence connection between tables in the statistical analysis of data is one of the commonly used operations, but the price is relatively high. In big data environment, the connection of large scale data tables is less efficient. In order to solve this problem, this paper proposed a method for optimization of two-table equivalent connection process based on Spark: first, constructed the Bloom Filter to complete the filtering operation according to the low density of data density; secondly combined the advantages of Simi-Join method and Partition Join methods, the he greedy algorithm Splitting methods is used for the filtered unilateral table; lastly joined the split subsets . Then the connection process of two big tables was changed into two stages of the two small table connection, Cost analysis and experiments show that the proposed algorithm has improved performance compared with the existing Spark-based connection operation performance and data tilt.
郑延斌、张子栋
计算技术、计算机技术
Spark等值链接大数据优化拆分
郑延斌,张子栋.基于Spark的两表等值连接过程优化[EB/OL].(2018-05-20)[2025-05-12].https://chinaxiv.org/abs/201805.00298.点此复制
评论