The Role of Generative AI in Advancing Educational Technology Research: A Systematic Review of Qualitative Data Analysis
DOI:
https://doi.org/10.17977/um039v11i12026p108-127Keywords:
Large Language Models, Qualitative Data Analysis, Thematic Coding, Inter-rater Reliability, Human-AI CollaborationAbstract
Large language models (LLMs) are increasingly used for qualitative data analysis; however, questions remain regarding their reliability compared to human coders. Following PRISMA 2020 guidelines, this systematic review synthesizes empirical evidence on the use of generative artificial intelligence for coding interview and focus group data. Of the 1,085 records retrieved from six academic databases between 2020 and 2026, 30 studies met the inclusion criteria. The findings indicate that LLMs, predominantly GPT-4, achieve moderate to substantial thematic agreement with human coders, with Cohen’s kappa values ranging from 0.40 to 0.91 (median 0.72) and accuracy rates between 77% and 96%. Reliability significantly improves with optimized prompting strategies and multi-run ensemble methods. Although LLMs demonstrate exceptional efficiency, reducing analysis time by 80% to 95%, they still face limitations in capturing cultural nuance, interpretive depth, and context-dependent coding. Therefore, current evidence supports the use of LLMs as an augmentation tool rather than a replacement for human researchers. Hybrid human-AI workflows, combining computational efficiency with human interpretive rigor, represent the most promising approach for robust qualitative analysis. For educational researchers, these findings highlight the potential of LLMs to advance qualitative learning analytics by enabling rapid processing of large-scale student data. Ultimately, this hybrid approach allows for deeper insights into technology-enhanced learning environments without sacrificing pedagogical nuance.
References
Borse, N. S., Subramaniam, R. C., & Rebello, N. S. (2025). Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis (Version 2). arXiv.https:/doi.org/10.48550/ARXIV.2508.14764
Creswell, J. W., & Poth, C. N. (2016). Qualitative inquiry and research design: Choosing among five approaches. Sage publications.
Jain, N., Suh, H., Adeyinka, S., Roseman, L., & Allsop, A. (2025). Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen’s Kappa and Semantic Similarity for Qualitative Research Validation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2512.20352
Kim, C., Ke, F., Zhang, N., & Barrett, A. (2025). LLM-supported Thematic Analysis: Evaluating GATOS Workflow on Complex Qualitative Data. https://doi.org/10.5281/ZENODO.15870242
Klieger, B., Charitsis, C., Suzara, M., Wang, S., Haber, N., & Mitchell, J. C. (2024). ChatCollab: Exploring Collaboration Between Humans and AI Agents in Software Teams (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2412.01992
Kondo, T., Miyachi, J., Jönsson, A., & Nishigori, H. (2024). A mixed-methods study comparing human-led and ChatGPT-driven qualitative analysis in medical education research (No. 4). Nagoya University Graduate School of Medicine, School of Medicine. https://doi.org/10.18999/nagjms.86.4.620
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159. https://doi.org/10.2307/2529310
Li, K. D., Fernandez, A. M., Schwartz, R., Rios, N., Carlisle, M. N., Amend, G. M., Patel, H. V., & Breyer, B. N. (2024). Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study. Journal of Medical Internet Research, 26, e56500. https://doi.org/10.2196/56500
Liu, A., & Sun, M. (2023). From Voices to Validity: Leveraging Large Language Models (LLMs) for Textual Analysis of Policy Stakeholder Interviews (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2312.01202
Lockwood, A., Newman, D., Mossing, K., Glubzinski, A., & Cohen, E. (2025). Human vs. Machine: A Comparative Analysis of Qualitative Coding by Humans and ChatGPT-4. PsyArXiv. https://doi.org/10.31234/osf.io/8g36r
Long, Y., Luo, H., & Zhang, Y. (2024). Evaluating large language models in analysing classroom dialogue. Npj Science of Learning, 9(1), 60. https://doi.org/10.1038/s41539-024-00273-3
Mellon, J., Bailey, J., Scott, R., Breckwoldt, J., Miori, M., & Schmedeman, P. (2024). Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale. Research & Politics, 11(1), 20531680241231468. https://doi.org/10.1177/20531680241231468
Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. In Algorithms of oppression. New York university press.
Nyaaba, M., SungEun, M., Apam, M. A., Acheampong, K. O., & Dwamena, E. (2025). Optimizing Generative AI’s Accuracy and Transparency in Inductive Thematic Analysis: A Human-AI Comparison (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2503.16485
Parkington, K., Teferra, B. G., Rouleau-Tang, M., Perivolaris, A., Rueda, A., Dubrowski, A., Kapralos, B., Samavi, R., Greenshaw, A., Zhang, Y., Cao, B., Wu, Y., Rambhatla, S., Krishnan, S., & Bhat, V. (2025). Human vs. LLM-Based Thematic Analysis for Digital Mental Health Research: Proof-of-Concept Comparative Study (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2507.08002
Pattyn, F. (2024). The Value of Generative AI for Qualitative Research: A Pilot Study. Journal of Data Science and Intelligent Systems. https://doi.org/10.47852/bonviewJDSIS4202964
Prescott, M. R., Yeager, S., Ham, L., Rivera Saldana, C. D., Serrano, V., Narez, J., Paltin, D., Delgado, J., Moore, D. J., & Montoya, J. (2024). Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses. JMIR AI, 3, e54482. https://doi.org/10.2196/54482
Qiao, S., Fang, X., Wang, J., Zhang, R., Li, X., & Kang, Y. (2024). Generative AI for Thematic Analysis in a Maternal Health Study: Coding Semi-structured Interviews using Large Language Models (LLMs). Public and Global Health. https://doi.org/10.1101/2024.09.16.24313707
Raza, M. Z., Xu, J., Lim, T., Boddy, L., Mery, C. M., Well, A., & Ding, Y. (2025). LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2502.01620
Sakaguchi, K., Sakama, R., & Watari, T. (2025). Evaluating ChatGPT in Qualitative Thematic Analysis With Human Researchers in the Japanese Clinical Context and Its Cultural Interpretation Challenges: Comparative Qualitative Study (Preprint). Journal of Medical Internet Research. https://doi.org/10.2196/preprints.71521
Sankaranarayanan, S., Borchers, C., Simon, S., Tajik, E., Ataş, A. H., Celik, B., Balzan, F., & Shahrokhian, B. (2025). Automating Thematic Analysis with Multi-Agent LLM Systems. EdArXiv. https://doi.org/10.35542/osf.io/kq8zh_v1
Shah, S. T. U., Hussein, M., Barcomb, A., & Moshirpour, M. (2025). From Inductive to Deductive: LLMs-Based Qualitative Data Analysis in Requirements Engineering (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.19384
Simon, S., Sankaranarayanan, S., Tajik, E., Borchers, C., Shahrokhian, B., Balzan, F., Strauß, S., Viswanathan, S. A., Ataş, A. H., Čarapina, M., Liang, L., & Celik, B. (2025). Comparing a Human’s and a Multi-Agent System’s Thematic Analysis: Assessing Qualitative Coding Consistency. EdArXiv. https://doi.org/10.35542/osf.io/ez8wc_v1
Theelen, H., Vreuls, J., & Rutten, J. (2024). Doing Research with Help from ChatGPT: Promising Examples for Coding and Inter-Rater Reliability. International Journal of Technology in Education, 7(1), 1–18. https://doi.org/10.46328/ijte.537
Wachinger, J., Bärnighausen, K., Schäfer, L. N., Scott, K., & McMahon, S. A. (2025). Prompts, Pearls, Imperfections: Comparing ChatGPT and a Human Researcher in Qualitative Data Analysis. Qualitative Health Research, 35(9), 951–966. https://doi.org/10.1177/10497323241244669
Yi, S., Nguyen, J., Xu, H., Lim, T., Skrovan, J., Beri, M., Modi, H., Well, A., Leqi, L., Markey, M., & Ding, Y. (2025). SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLMs for Automated Inductive Thematic Analysis (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2509.17167
Yue, Y., Liu, D., Lv, Y., Hao, J., & Cui, P. (2024). A Practical Guide and Assessment on Using ChatGPT to Conduct Grounded Theory: Tutorial (Preprint). Journal of Medical Internet Research. https://doi.org/10.2196/preprints.70122
Zhang, H., Wu, C., Xie, J., Rubino, F., Graver, S., Kim, C., Carroll, J. M., & Cai, J. (2024). When Qualitative Research Meets Large Language Model: Exploring the Potential of QualiGPT as a Tool for Qualitative Coding (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2407.14925
