|国家预印本平台
首页|Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

来源:Arxiv_logoArxiv
英文摘要

As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at https://github.com/Mihir3009/Multi-LogiEval.

Neeraj Varshney、Mutsumi Nakamura、Chitta Baral、Mohith Kulkarni、Aashna Budhiraja、Mihir Parmar、Nisarg Patel

计算技术、计算机技术

Neeraj Varshney,Mutsumi Nakamura,Chitta Baral,Mohith Kulkarni,Aashna Budhiraja,Mihir Parmar,Nisarg Patel.Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models[EB/OL].(2024-06-24)[2025-08-02].https://arxiv.org/abs/2406.17169.点此复制

评论