Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.
Youngjoon Jang、Junyoung Son、Taemin Lee、Seongtae Hong、Heuiseok Lim
常用外国语阿尔泰语系(突厥-蒙古-通古斯语系)
Youngjoon Jang,Junyoung Son,Taemin Lee,Seongtae Hong,Heuiseok Lim.Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging[EB/OL].(2025-07-11)[2025-07-25].https://arxiv.org/abs/2507.08480.点此复制
评论