|国家预印本平台
首页|Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes

Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes

Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes

中文摘要英文摘要

Since their introduction by James Dixon in 2010, data lakes get more and more attention, driven by thepromise of high reusability of the stored data due to the schema-on-read semantics. Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information, an overarching data governance, orthe integration with (high-performance) processing capabilities. Although the necessity for a logical and aphysical organisation of data lakes in order to meet those requirements is widely recognized, no concreteguidelines are yet provided. The most common architecture implementing this conceptual organisation is thezone architecture, where data is assigned to a certain zone depending on the degree of processing. This paperdiscusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on datatypes instead of zones, how they can be used to abstract the physical implementation, and how they empowergeneric and portable processing capabilities based on a provenance-based approach.

Since their introduction by James Dixon in 2010, data lakes get more and more attention, driven by thepromise of high reusability of the stored data due to the schema-on-read semantics. Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information, an overarching data governance, orthe integration with (high-performance) processing capabilities. Although the necessity for a logical and aphysical organisation of data lakes in order to meet those requirements is widely recognized, no concreteguidelines are yet provided. The most common architecture implementing this conceptual organisation is thezone architecture, where data is assigned to a certain zone depending on the degree of processing. This paperdiscusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on datatypes instead of zones, how they can be used to abstract the physical implementation, and how they empowergeneric and portable processing capabilities based on a provenance-based approach.

Philipp, Wieder、Hendrik, Nolte

10.12074/202211.00423V1

信息科学、信息技术自然科学研究方法计算技术、计算机技术

ata lakeProvenanceWorkflowsFAIR Digital ObjectsCWFR

ata lakeProvenanceWorkflowsFAIR Digital ObjectsCWFR

Philipp, Wieder,Hendrik, Nolte.Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes[EB/OL].(2022-11-28)[2025-05-07].https://chinaxiv.org/abs/202211.00423.点此复制

评论