GLoRE: Evaluating Logical Reasoning of Large Language Models
GLoRE: Evaluating Logical Reasoning of Large Language Models
Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.
Hanmeng liu、Zhiyang Teng、Ruoxi Ning、Yiran Ding、Xiulai Li、Xiaozhang Liu、Yue Zhang
计算技术、计算机技术
Hanmeng liu,Zhiyang Teng,Ruoxi Ning,Yiran Ding,Xiulai Li,Xiaozhang Liu,Yue Zhang.GLoRE: Evaluating Logical Reasoning of Large Language Models[EB/OL].(2023-10-13)[2025-06-27].https://arxiv.org/abs/2310.09107.点此复制
评论