Classroom Dynamics Research
Conference Presentation (upcoming):: Multimodal Speaker Identification in Classroom Environments
Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based ”contextual anchoring” into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.
Preprint available soon.
citation: Chrzan, M. L., Krishnaswamy, M., Gibboni, R., Wetstone, K., Tabatabaee, S., Ai, W., & Liu, J. (2026, May). Multimodal Speaker Identification in Classroom Environments. Stanford Educational Data Science (EDS) Conference, Stanford, CA. Accepted for Presentation.