Classroom Dynamics Research

Multimodal Speaker Identification in Classroom Environments

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based ”contextual anchoring” into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

Preprint available here.

Citation: Chrzan, M. L., Krishnaswamy, M., Gibboni, R., Wetstone, K., Ai, W., & Liu, J. (2026). Multimodal speaker identification in classroom environments. arXiv. https://doi.org/10.48550/arXiv.2606.13712

Also seen at:

Stanford Educational Data Science (EDS) Conference (May 2026), Stanford, CA.

Michael Leon Chrzan