The Role of Generative AI in Advancing Educational Technology Research: A Systematic Review of Qualitative Data Analysis

Dedi Aco; Ming-Chou Liu; Harmita Sari

doi:10.17977/um039v11i12026p108-127

Authors

Dedi Aco National Dong Hwa University
Ming-Chou Liu National Dong Hwa University
Harmita Sari Universitas Muhammadiyah Palopo

DOI:

https://doi.org/10.17977/um039v11i12026p108-127

Keywords:

Large Language Models, Qualitative Data Analysis, Thematic Coding, Inter-rater Reliability, Human-AI Collaboration

Abstract

Large language models (LLMs) are increasingly used for qualitative data analysis; however, questions remain regarding their reliability compared to human coders. Following PRISMA 2020 guidelines, this systematic review synthesizes empirical evidence on the use of generative artificial intelligence for coding interview and focus group data. Of the 1,085 records retrieved from six academic databases between 2020 and 2026, 30 studies met the inclusion criteria. The findings indicate that LLMs, predominantly GPT-4, achieve moderate to substantial thematic agreement with human coders, with Cohen’s kappa values ranging from 0.40 to 0.91 (median 0.72) and accuracy rates between 77% and 96%. Reliability significantly improves with optimized prompting strategies and multi-run ensemble methods. Although LLMs demonstrate exceptional efficiency, reducing analysis time by 80% to 95%, they still face limitations in capturing cultural nuance, interpretive depth, and context-dependent coding. Therefore, current evidence supports the use of LLMs as an augmentation tool rather than a replacement for human researchers. Hybrid human-AI workflows, combining computational efficiency with human interpretive rigor, represent the most promising approach for robust qualitative analysis. For educational researchers, these findings highlight the potential of LLMs to advance qualitative learning analytics by enabling rapid processing of large-scale student data. Ultimately, this hybrid approach allows for deeper insights into technology-enhanced learning environments without sacrificing pedagogical nuance.

References

Borse, N. S., Subramaniam, R. C., & Rebello, N. S. (2025). Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis (Version 2). arXiv.https:/doi.org/10.48550/ARXIV.2508.14764

Creswell, J. W., & Poth, C. N. (2016). Qualitative inquiry and research design: Choosing among five approaches. Sage publications.

Jain, N., Suh, H., Adeyinka, S., Roseman, L., & Allsop, A. (2025). Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen’s Kappa and Semantic Similarity for Qualitative Research Validation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2512.20352

Kim, C., Ke, F., Zhang, N., & Barrett, A. (2025). LLM-supported Thematic Analysis: Evaluating GATOS Workflow on Complex Qualitative Data. https://doi.org/10.5281/ZENODO.15870242

Klieger, B., Charitsis, C., Suzara, M., Wang, S., Haber, N., & Mitchell, J. C. (2024). ChatCollab: Exploring Collaboration Between Humans and AI Agents in Software Teams (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2412.01992

Kondo, T., Miyachi, J., Jönsson, A., & Nishigori, H. (2024). A mixed-methods study comparing human-led and ChatGPT-driven qualitative analysis in medical education research (No. 4). Nagoya University Graduate School of Medicine, School of Medicine. https://doi.org/10.18999/nagjms.86.4.620

Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159. https://doi.org/10.2307/2529310

Li, K. D., Fernandez, A. M., Schwartz, R., Rios, N., Carlisle, M. N., Amend, G. M., Patel, H. V., & Breyer, B. N. (2024). Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study. Journal of Medical Internet Research, 26, e56500. https://doi.org/10.2196/56500

Liu, A., & Sun, M. (2023). From Voices to Validity: Leveraging Large Language Models (LLMs) for Textual Analysis of Policy Stakeholder Interviews (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2312.01202

Lockwood, A., Newman, D., Mossing, K., Glubzinski, A., & Cohen, E. (2025). Human vs. Machine: A Comparative Analysis of Qualitative Coding by Humans and ChatGPT-4. PsyArXiv. https://doi.org/10.31234/osf.io/8g36r

Long, Y., Luo, H., & Zhang, Y. (2024). Evaluating large language models in analysing classroom dialogue. Npj Science of Learning, 9(1), 60. https://doi.org/10.1038/s41539-024-00273-3

Mellon, J., Bailey, J., Scott, R., Breckwoldt, J., Miori, M., & Schmedeman, P. (2024). Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale. Research & Politics, 11(1), 20531680241231468. https://doi.org/10.1177/20531680241231468

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. In Algorithms of oppression. New York university press.

Nyaaba, M., SungEun, M., Apam, M. A., Acheampong, K. O., & Dwamena, E. (2025). Optimizing Generative AI’s Accuracy and Transparency in Inductive Thematic Analysis: A Human-AI Comparison (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2503.16485

Parkington, K., Teferra, B. G., Rouleau-Tang, M., Perivolaris, A., Rueda, A., Dubrowski, A., Kapralos, B., Samavi, R., Greenshaw, A., Zhang, Y., Cao, B., Wu, Y., Rambhatla, S., Krishnan, S., & Bhat, V. (2025). Human vs. LLM-Based Thematic Analysis for Digital Mental Health Research: Proof-of-Concept Comparative Study (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2507.08002

Pattyn, F. (2024). The Value of Generative AI for Qualitative Research: A Pilot Study. Journal of Data Science and Intelligent Systems. https://doi.org/10.47852/bonviewJDSIS4202964

Prescott, M. R., Yeager, S., Ham, L., Rivera Saldana, C. D., Serrano, V., Narez, J., Paltin, D., Delgado, J., Moore, D. J., & Montoya, J. (2024). Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses. JMIR AI, 3, e54482. https://doi.org/10.2196/54482

Qiao, S., Fang, X., Wang, J., Zhang, R., Li, X., & Kang, Y. (2024). Generative AI for Thematic Analysis in a Maternal Health Study: Coding Semi-structured Interviews using Large Language Models (LLMs). Public and Global Health. https://doi.org/10.1101/2024.09.16.24313707

Raza, M. Z., Xu, J., Lim, T., Boddy, L., Mery, C. M., Well, A., & Ding, Y. (2025). LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2502.01620

Sakaguchi, K., Sakama, R., & Watari, T. (2025). Evaluating ChatGPT in Qualitative Thematic Analysis With Human Researchers in the Japanese Clinical Context and Its Cultural Interpretation Challenges: Comparative Qualitative Study (Preprint). Journal of Medical Internet Research. https://doi.org/10.2196/preprints.71521

Sankaranarayanan, S., Borchers, C., Simon, S., Tajik, E., Ataş, A. H., Celik, B., Balzan, F., & Shahrokhian, B. (2025). Automating Thematic Analysis with Multi-Agent LLM Systems. EdArXiv. https://doi.org/10.35542/osf.io/kq8zh_v1

Shah, S. T. U., Hussein, M., Barcomb, A., & Moshirpour, M. (2025). From Inductive to Deductive: LLMs-Based Qualitative Data Analysis in Requirements Engineering (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.19384

Simon, S., Sankaranarayanan, S., Tajik, E., Borchers, C., Shahrokhian, B., Balzan, F., Strauß, S., Viswanathan, S. A., Ataş, A. H., Čarapina, M., Liang, L., & Celik, B. (2025). Comparing a Human’s and a Multi-Agent System’s Thematic Analysis: Assessing Qualitative Coding Consistency. EdArXiv. https://doi.org/10.35542/osf.io/ez8wc_v1

Theelen, H., Vreuls, J., & Rutten, J. (2024). Doing Research with Help from ChatGPT: Promising Examples for Coding and Inter-Rater Reliability. International Journal of Technology in Education, 7(1), 1–18. https://doi.org/10.46328/ijte.537

Wachinger, J., Bärnighausen, K., Schäfer, L. N., Scott, K., & McMahon, S. A. (2025). Prompts, Pearls, Imperfections: Comparing ChatGPT and a Human Researcher in Qualitative Data Analysis. Qualitative Health Research, 35(9), 951–966. https://doi.org/10.1177/10497323241244669

Yi, S., Nguyen, J., Xu, H., Lim, T., Skrovan, J., Beri, M., Modi, H., Well, A., Leqi, L., Markey, M., & Ding, Y. (2025). SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLMs for Automated Inductive Thematic Analysis (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2509.17167

Yue, Y., Liu, D., Lv, Y., Hao, J., & Cui, P. (2024). A Practical Guide and Assessment on Using ChatGPT to Conduct Grounded Theory: Tutorial (Preprint). Journal of Medical Internet Research. https://doi.org/10.2196/preprints.70122

Zhang, H., Wu, C., Xie, J., Rubino, F., Graver, S., Kim, C., Carroll, J. M., & Cai, J. (2024). When Qualitative Research Meets Large Language Model: Exploring the Potential of QualiGPT as a Tool for Qualitative Coding (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2407.14925

The Role of Generative AI in Advancing Educational Technology Research: A Systematic Review of Qualitative Data Analysis

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Menu

EDITORIAL TEAM

FOCUS & SCOPE

TARGET READERSHIPS

PUBLICATION ETHICS

COMPLAINTS AND APPEALS

REVIEW PROCESS

PLAGIARISM POLICY

ARCHIVING POLICY

POST-PUBLICATION DISCUSSIONS AND CORRECTIONS

ALLEGATIONS OF RESEARCH MISCONDUCT

AUTHOR GUIDELINES

OPEN ACCESS POLICY

COPYRIGHT NOTICE & LICENSING

CONTACT

AUTHOR FEES

TEMPLATE

GENERATIVE AI POLICY

SCOPUS CITATIONS ANALYSIS

JOURNAL HISTORY

Current Issue

Information