首页|Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

来源：

英文摘要

Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

作者：Prathamesh Kokate、Mitali Sarnaik、Manavi Khopade、Raviraj Joshi

作者单位：

学科分类：南亚语系（澳斯特罗-亚细亚语系）

推荐引用：Prathamesh Kokate,Mitali Sarnaik,Manavi Khopade,Raviraj Joshi.Improving the Efficiency of Long Document Classification using Sentence Ranking Approach[EB/OL].(2025-06-22)[2025-06-29].https://arxiv.org/abs/2506.07248.点此复制

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

评论