laude 3 Opus和Kimi在生物医学期刊同行评议中的能力评价
Evaluation of Claude 3 Opus and Kimi in peer review for biomedical journals
目的] 本研究旨在评估Claude 3 Opus和Kimi在生物医学期刊同行评议中的能力。 [方法] 对《中国癌症杂志》进入定稿会终审的29篇论文,采用Claude 3 Opus和Kimi 进行同行评议,并根据生物医学研究报告指南中的披露清单进行审查,所有论文作者均授权同意用AIGC进行同行评议。采用李克特5分量表对2者同行评议结果进行评分。计数资料采用单因素方差检验或Fisher精确概率检验,多组配对计量资料采用Friedman M检验,四格表法计算2者评议结果的灵敏度、特异度、阳性预测值、阴性预测值和准确性,并绘制ROC曲线,评估其预测能力。[结果]29篇文章,定稿会评议后发表6篇、修改后发表15篇、退稿8篇,Claude 3 Opus同行评议结论发表19篇、修改后发表10篇,Kimi同行评议结论发表9篇、修后发表16篇、退稿4篇,Friedman M检验结果显示,Kimi与定稿会专家评议结论差异无统计学意义(M=0.241,调整后P=1.000)。李克特量表结果显示:专家对Kimi评议结果认可度高于Claude 3 Opus的评议结果(3.85±0.47 vs 3.48±0.73,F=10.017,P=0.002)。在生物医学研究报告指南披露清单的审查方面,Claude 3 Opus评价的准确性为77.5%,灵敏度为76.9%,特异度为64.0%;Kimi 评价的准确性为75.2%,灵敏度为77.5%,特异度为70.1%。ROC曲线分析结果显示,Claude 3 Opus和Kimi曲线下面积分别为0.818和0.841,有较好的预测能力。经检验,Kimi的审查结果与责编审查结果差异无统计学意义(M=-0.152,调整后P=0.061)。[结论] Claude 3 Opus和Kimi在生物医学研究报告指南披露清单的审查方面表现出较好的能力,与责任编辑审查有较高的一致性。然而两款AIGC模型尚未能达到专家同行评议的能力,存在①生成内容不准确性;②生成内容缺乏个性化;③外部内容的推演不足;④评审内容的模糊性;⑤生成内容过于粗况;⑥审稿结论偏于正向评价等问题,但是在展现出一定应用潜力。为了进一步提高其效能,应开发高质量的AIGC专用工具辅助用于生物医学期刊的同行评议,建议医学专家、医学编辑和AIGC开发者共同努力,制定相关标准,确保数据安全和质量,提高透明度,减少评审偏倚,遵循出版伦理,并建立有效的监督和反馈机制,以确保AIGC在生物医学期刊同行评议中的准确性。
Purposes] This study aimed to evaluate the capabilities of Claude 3 Opus and Kimi in peer review for biomedical journals. [Methods] Peer reviews were conducted using Claude 3 Opus and Kimi on 29 papers from China Oncology that were at the final review stage before publication, and were examined according to the disclosure checklist in the Biomedical Research Reporting Guidelines. All authors declared their consent to use AIGC for peer review. The results of the AIGC peer reviews were scored using a 5-point Likert scale. Count data were analyzed using one-way ANOVA or Fishers exact probability test, and multiple paired measurement data were analyzed using the Friedman M test. Sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of the two types of AIGC assessments were calculated using a four-fold table method, and ROC curves were plotted and assess their predictive capability. [Findings] Of the 29 articles, after the final review meeting, 6 were published, 15 were published after revisions, and 8 were rejected. Claude 3 Opuss peer review conclusions led to the publication of 19 articles and the publication of 10 after revisions. Kimis peer review conclusions resulted in the publication of 9 articles, 16 after revisions, and 4 rejections. The Friedman M test showed no statistically significant difference between Kimis peer review conclusions and the experts conclusions at the final review meeting (M=0.241, adjusted P=1.000). The Likert scale results indicated that the experts at the final review meeting had a higher level of agreement with Kimis peer review results than with those of Claude 3 Opus (3.850.47 vs 3.480.73, F=10.017, P=0.002). In terms of reviewing according to the Biomedical Research Reporting Guidelines disclosure checklist, Claude 3 Opuss evaluation accuracy was 77.5%, sensitivity was 76.9%, and specificity was 64.0%; Kimis evaluation accuracy was 75.2%, sensitivity was 77.5%, specificity was 70.1%. ROC curve analysis showed that the areas under the curve for Claude 3 Opus and Kimi were 0.818 and 0.841, respectively, indicating good predictive capability. Upon testing, there was no statistically significant difference between Kimis review results and the editors review results (M=-0.152, adjusted P=0.061). [Conclusions] Claude 3 Opus and Kimi have shown good capabilities in reviewing the disclosure checklist of biomedical research reports, with high consistency compared to editorial reviews. However, these two AIGC models have not yet reached the level of expert peer review, exhibiting issues such as:Inaccuracy in generated content;Lack of personalization in generated content;Insufficient extrapolation of external content;Ambiguity in review content;Overly coarse-grained generated content;Bias towards positive evaluations in review conclusions. Despite these issues, they still demonstrate certain potential for application. To further enhance their effectiveness, it is recommended that medical experts, medical editors, and AIGC developers work together to establish relevant standards, ensure data security and quality, increase transparency, reduce review bias, adhere to publishing ethics, and establish effective supervision and feedback mechanisms to ensure the accuracy of AIGC in peer review for biomedical journals.
倪明
科学交流与知识传播
生物医学期刊生成式人工智能laude 3 OpusKimi同行评议
Biomedical journalsGenerative artificial intelligencelaude 3 OpusKimiPeer review
倪明.laude 3 Opus和Kimi在生物医学期刊同行评议中的能力评价[EB/OL].(2025-01-08)[2025-01-15].https://chinaxiv.org/abs/202412.00331.点此复制
评论