基于抖音指数的甲状腺癌问题集在大型语言模型中的信息质量及可读性分析
BackgroundLarge language modelsLLMs are gaining public familiarity and are increasingly adopted in healthcare contexts. Thyroid cancer represents a common malignancy in Chinawhere patients express substantial unmet needs for evidence-based disease information. Neverthelessno studies have assessed the quality and readability of LLM-generated responses regarding thyroid cancer in the Chinese context. ObjectiveTo evaluate and compare the quality and readability of responses generated by domestic large language modelsLLMs to thyroid cancer-related queries. MethodsThe Douyin Index was used to identify a set of 25 questions pertaining to thyroid cancer. Response texts were generated using DeepSeek DeepSeek-R1-0120Qwenqwen-max-2025-01-25and GLMGLM-4Plus. Cosine similarity is a metric used to evaluate the similarity between texts generated at different time pointsthereby assessing the stability of the model. To assess the quality of the informationthe modified version of the Health Information Quality Assessment ToolmDISCERN was employed. Additionallythe Chinese Readability Formula was utilized to evaluate the readability of the texts. To explore the differences in the quality and stability of response text information between modelsthe following methodologies are appliedcluster heatmapsprincipal component analysisPCAFriedman testsand signed rank tests. AdditionallyPearson correlation analysis is used to examine the relationship between information quality and readability. ResultsThe text similarity evaluation results show that the proportion of moderately similar texts on Deepseek is 12%the proportion of highly similar texts is 88%and the proportion of highly similar texts in the two responses of Qwen and GLM is 100%. A comparative analysis of information quality and readability across the three models showed statistically significant differencesP<0.001. SpecificallyDeepSeek demonstrated superior performance in terms of information qualityas indicated by a significant chi-squared test resultZ=35.396P<0.001. Howeverits readability was comparatively lowerR=7.5251.006. Qwen and GLM exhibited comparable information qualitywith GLM outperforming in question clusters 2 and 3while Qwen excelled in responding to question cluster 1. The overall correlation between information quality and readability was found to be negativer=-0.370P=0.010. ConclusionLLMs in China have significant potential to provide essential health education to patients with thyroid cancer. Howeverconcerns have been raised regarding inaccuracies in the generated content and the occurrence of AI hallucinations. When patients actually apply LLMs to obtain health informationthey should consider comprehensively in combination with the response texts from different platforms and the doctor's suggestions. In terms of the modelit is necessary to balance the professionalism and popularity of the information and establish a medical content security review mechanism to ensure the accuracy and professionalism of the information.
薛梦元、彭映华、宁艳婷、马恒、赵博慧、黄映彤
518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院头颈外科518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院头颈外科518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院护理部518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院头颈外科100021 北京市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院胸外科518116 广东省深圳市,国家癌症中心 国家肿瘤临床医学研究中心 中国医学科学院北京协和医学院肿瘤医院深圳医院头颈外科
医学研究方法医药卫生理论肿瘤学
大型语言模型甲状腺癌信息质量可读性分析医疗人工智能
薛梦元,彭映华,宁艳婷,马恒,赵博慧,黄映彤.基于抖音指数的甲状腺癌问题集在大型语言模型中的信息质量及可读性分析[EB/OL].(2025-07-14)[2025-07-18].https://chinaxiv.org/abs/202507.00153.点此复制
评论