|国家预印本平台
首页|Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

来源:Arxiv_logoArxiv
英文摘要

Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985's main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.

Zhyar Rzgar K Rostam、Gábor Kertész

10.1109/SAMI63904.2025.10883153

自然科学研究方法信息科学、信息技术计算技术、计算机技术

Zhyar Rzgar K Rostam,Gábor Kertész.Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting[EB/OL].(2025-04-26)[2025-05-25].https://arxiv.org/abs/2504.19021.点此复制

评论