|国家预印本平台
首页|基于 Spark 分布式框架的海量星表数据时序重构方法研究

基于 Spark 分布式框架的海量星表数据时序重构方法研究

Research on Time Series Reconstruction Method of MassiveAstronomical Catalogues Based on Spark DistributedFramework

中文摘要英文摘要

时序重构是时域天文学中的一个重要数据处理步骤,也是拟合光变曲线、开展时域分析研究的基础。Hadoop、Spark 这类MapReduce 分布式模型在执行过程中分布式集群{节点间的任务}比较独立,需要跨节点的数据传输量较少。提出了非阻塞异步执行流程,每个分布式进程完全针对独立天区的数据进行连续处理,而分块边缘的新增天体导致的其他节点的新增证认任务延时批量追加,并且会根据各进程间的进度不同确定追加方式,保证证认计算没有遗漏,从而在提高并发效率的同时保证算法的精度。此外,对两表间的不同Join 策略从理论和实验两个角度进行了研究并提出了免Join 策略。最后通过基于Spark 分布式框架的高效时序重构系统的设计完成了以上研究的验证。实验表明,与以往研究结果相比,该时序重构算法效率提升明显,为时域天文学中的天文时序数据分析的开展打下了良好的基础。

ime series reconstruction is a crucial data processing step in time domain astronomy and serves as the foundation for fitting light curves and conducting time domain analysis. For many large-field time domain surveys, it is necessary to complete this computational process within a single exposure cycle. With the rapid increase in astronomical data, existing methods for astronomical data processing struggle to simultaneously meet the accuracy and efficiency requirements of time-series reconstruction. The memory-based computing general-purpose distributed framework, Spark, holds the potential to improve the efficiency of this process. However, applying Spark directly often encounters issues. MapReduce distributed models like Hadoop and Spark require relatively independent tasks among distributed cluster nodes and minimal data transfer across nodes during execution. Otherwise, frequent communication becomes an efficiency bottleneck for the application of the model. However, due to the presence of boundary problems in cross-matching, it is inevitable to transmit newly added data at the boundaries, severely restricting the concurrency of the model and reducing the acceleration ratio in practical parallel model applications. Therefore, we propose a non-blocking asynchronous execution flow, where each distributed process handles continuous processing exclusively for independent sky regions. The delayed batch appending of additional identification tasks from block-edge newly added celestial bodies in other nodes is determined based on the progress of each process. This ensures that identification calculations are not omitted, thereby improving concurrent efficiency while maintaining algorithm accuracy. Additionally, a research study was conducted on different join strategies between two tables, examining them from both theoretical and experimental perspectives. Furthermore, a join-free strategy was proposed. Finally, the design of an efficient time-series reconstruction system based on the Spark distributed framework validates the aforementioned research. Experimental results demonstrate a significant improvement in the efficiency of the proposed time-series reconstruction algorithm compared to previous research, laying a solid foundation for the analysis of astronomical time-series data in time-domain astronomy.

樊东卫、崔辰州、陈亚瑞、权文利、赵 青

10.12074/202403.00308V1

天文学计算技术、计算机技术

时域天文学交叉证认计算时序重构分布式计算Spark

time domain astronomycross-match calculationtime series reconstructiondistributed computationSpark

樊东卫,崔辰州,陈亚瑞,权文利,赵 青.基于 Spark 分布式框架的海量星表数据时序重构方法研究[EB/OL].(2024-03-26)[2025-06-07].https://chinaxiv.org/abs/202403.00308.点此复制

评论