|国家预印本平台
首页|Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge

Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge

Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge

来源:Arxiv_logoArxiv
英文摘要

Assessing teachers' pedagogical content knowledge (PCK) through performance-based tasks is both time and effort-consuming. While large language models (LLMs) offer new opportunities for efficient automatic scoring, little is known about whether LLMs introduce construct-irrelevant variance (CIV) in ways similar to or different from traditional machine learning (ML) and human raters. This study examines three sources of CIV -- scenario variability, rater severity, and rater sensitivity to scenario -- in the context of video-based constructed-response tasks targeting two PCK sub-constructs: analyzing student thinking and evaluating teacher responsiveness. Using generalized linear mixed models (GLMMs), we compared variance components and rater-level scoring patterns across three scoring sources: human raters, supervised ML, and LLM. Results indicate that scenario-level variance was minimal across tasks, while rater-related factors contributed substantially to CIV, especially in the more interpretive Task II. The ML model was the most severe and least sensitive rater, whereas the LLM was the most lenient. These findings suggest that the LLM contributes to scoring efficiency while also introducing CIV as human raters do, yet with varying levels of contribution compared to supervised ML. Implications for rater training, automated scoring design, and future research on model interpretability are discussed.

Yaxuan Yang、Shiyu Wang、Xiaoming Zhai

教育计算技术、计算机技术

Yaxuan Yang,Shiyu Wang,Xiaoming Zhai.Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge[EB/OL].(2025-05-25)[2025-06-12].https://arxiv.org/abs/2505.19266.点此复制

评论