
대규모 언어 모델은 분석 도구가 될 수 있는가? : GPT를 활용한 내용 분석의 신뢰도와 타당도를 중심으로
초록
본 연구는 커뮤니케이션 연구 영역에서 대규모 언어 모델(LLMs)을 컴퓨터 기반 텍스트 분석 방법(CTAM)에 적용할 수 있는가의 여부를 탐색한다. LLM 기술 발전에 따른 텍스트 분석 도구로서의 가능성이 높아지고 있음에도 불구하고 측정 도구로서의 LLMs의 한계라고 할 수 있는 내적 신뢰도와 외적 타당도에 대한 문제를 해결할 필요가 있다. 기존의 전통적인 코딩 방식이나 규칙 기반 텍스트 분류 시스템과 달리 LLMs의 경우 항상 일관되거나 재현 가능한 출력이 보장되지 않기 때문이다. 이는 LLMs이 학술 연구에서 적절한 텍스트 분석 도구로 기능할 수 있는지에 대한 커다란 의문을 제기한다.
본 연구에서는 LLMs이 CTAM에 적용될 수 있는 그 가능성을 평가하기 위해 내적 신뢰도와 외적 타당도의 관점에서 LLMs의 텍스트 분석 결과물을 평가한다. 특히 LLMs의 결과물이 반복된 동일 프롬프트에 대해 일관된 결과를 출력하는지(내적 신뢰도), 인간 코딩 결과와 비교하여 동일한 결과를 출력하는지(외적 타당도)를 평가한다. 다수의 팩트체크 뉴스 기사에 대해 인간 코더들이 정보 추출(information extraction)을 수행한 데이터를 동일하게 LLMs으로 분석하게 하였다. 이를 통해 반복 분석을 통한 내적 신뢰도 평가, 인간 코더의 결과와의 비교를 통한 외적 타당도 평가라는 다단계 검증 절차를 수행했다.
연구 결과, LLMs을 통한 텍스트 분석의 결과물은 신뢰할만한 수준의 내적 신뢰도와 외적 타당도를 확보하고 있음을 보여주고 있다. 반복 분석을 통해 LLMs이 일정한 분석 결과를 제공할 수 있으며, 인간 코더와의 비교를 통해 충분한 외적 타당도를 확보할 수 있음을 보였다. 그러나 뉴스 기사의 장르나 정보의 종류에 따라 신뢰도와 타당도가 크게 하락하는 현상이 확인되었다. 수치 데이터를 주로 사용하거나, 경제 뉴스의 경우 신뢰도와 타당도가 낮게 나타났으며, 이는 LLMs의 텍스트 분석 결과가 조건에 따라 사용 가능성이 달라질 수 있음을 의미한다.
따라서 LLMs을 연구의 측정 도구로 사용하기 위해서는 절차와 분석 대상에 대해 주의가 필요하다. 정교한 프롬프트의 제작, 동일 프롬프트의 반복 측정을 통한 내적 신뢰도 확인, 그리고 인간 코딩과의 비교를 통한 외적 타당도 확보라는 절차가 제안된다. 또한 다양한 조건과 맥락에 대한 테스트를 통해 LLMs이 어떤 조건에서 어느 정도 수준의 성능이 나타나는지 검증할 필요가 있다.
이 과정을 통해 확인된 본 연구의 결과는 LLMs이 커뮤니케이션 연구 분야에서 의미 있는 분석 도구로서 활용될 수 있는 기반을 마련하고 미래의 텍스트 분석 방법론에 기여할 것으로 기대한다. LLMs의 한계점을 연구 수행 절차를 통해 극복할 가능성을 확인할 수 있었다. 본 연구에서 제안된 분석 절차의 경우 앞으로 지속적인 논의를 통해 표준화된 절차로 발전해야 할 필요가 있다.
Abstract
This study explores the applicability of Large Language Models (LLMs) to computational text analysis methods (CTAM) in communication studies. Despite the increasing potential of LLMs as text analysis tools due to advancements in LLMs technology, addressing internal reliability and external validity remains crucial, as these are inherent limitations of LLMs when used as measurement tools. Unlike traditional coding methods or rule-based text categorization systems, LLMs do not always guarantee consistent or reproducible outputs. This raises a critical question regarding whether LLMs can function as appropriate text analysis tools in academic research.
To assess the potential of LLMs for CTAM, this study evaluates their text analysis outputs in terms of internal reliability and external validity. Specifically, it examines whether LLMs produce consistent results for the same prompts upon repeated analyses (internal reliability) and whether their outputs align with human coding results (external validity). For a large dataset of fact-checked news articles, human coders performed information extraction, and the same dataset was analyzed using a LLMs. A multi-step validation process was conducted, assessing internal reliability through repeated analyses and external validity through comparison with human coding results.
The findings indicate that text analysis using LLMs exhibits an acceptable level of internal reliability and external validity. Iterative analyses demonstrated that LLMs provide consistent analytical results, while comparisons with human coders confirmed sufficient level of external validity. However, reliability and validity significantly decreased depending on the genre of the news article and the type of information analyzed. The study found that LLMs exhibited lower reliability and validity when processing news articles relying heavily on numerical data or economic news, suggesting that the usability of LLMs in text analysis may vary depending on specific conditions.
Therefore, using LLMs as a measurement tool in research requires careful consideration of procedural frameworks and the nature of the data being analyzed. To ensure internal reliability, it is recommended to implement elaborate prompt engineering and repeated measurement for the same prompts. Additionally, external validity should be reinforced through comparisons with human coding results. Furthermore, LLMs need to be tested across various conditions and contexts to determine the specific circumstances under which they perform optimally.
The findings of this study, verified through these processes, are expected to lay the foundation for the utilization of LLMs as meaningful analytical tools in communication studies and to contribute to the advancement of text analysis methodologies. While LLMs have inherent limitations, these can be mitigated through the systematic procedures established in this study. The analytical procedure proposed here requires further discussion and refinement to develop into a standardized framework for future research.
Keywords:
large language models, computational text analysis method, reliabilty, validity, computational method키워드:
대규모 언어 모델(LLMs), 컴퓨터 기반 텍스트 분석 방법(CTAM), 신뢰도, 타당도, 컴퓨터 기반 방법론References
-
Asbury-Kimmel, V., Chang, K. C., McCabe, K. T., Munger, K., & Ventura, T. (2021). The effect of streaming chat on perceptions of political debates. Journal of Communication, 71(6), 947-974.
[https://doi.org/10.1093/joc/jqab041]
-
Birkenmaier, L., Lechner, C. M., & Wagner, C. (2024). The search for solid ground in text as data: A systematic review of validation practices and practical recommendations for validation. Communication Methods and Measures, 18(3), 249-277.
[https://doi.org/10.1080/19312458.2023.2285765]
-
Buijsman, S. (2024). Transparency for AI systems: A value-based approach. Ethics and Information Technology, 26(2), 34.
[https://doi.org/10.1007/s10676-024-09770-w]
-
Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., ... & Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications, 15(1), 1418.
[https://doi.org/10.1038/s41467-024-45563-x]
-
DeButts, M., & Pan, J. (2024). Reporting after removal: The effects of journalist expulsion on foreign news coverage. Journal of Communication, 74(4), 273-286.
[https://doi.org/10.1093/joc/jqae015]
-
Deng, X., Bashlovkina, V., Han, F., Baumgartner, S., & Bendersky, M. (2023, April). LLMs to the moon? Reddit market sentiment analysis with large language models. Paper presented at the ACM Web Conference 2023, Austin, TX.
[https://doi.org/10.1145/3543873.3587605]
-
Dentella, V., Günther, F., & Leivada, E. (2023). Systematic testing of three language models reveals low language accuracy, absence of response stability, and a yes-response bias. Proceedings of the National Academy of Sciences, 120(51), e2309583120.
[https://doi.org/10.1073/pnas.2309583120]
-
DiMaggio, P. (2015). Adapting computational text analysis to social science (and vice versa). Big Data & Society, 2(2).
[https://doi.org/10.1177/2053951715602908]
-
Feng, X., Han, X., Chen, S., & Yang, W. (2024). LLMEffiChecker: Understanding and testing efficiency degradation of large language models. ACM Transactions on Software Engineering and Methodology, 33(7), 186.
[https://doi.org/10.1145/3664812]
-
Gandolfi, A. (2024). GPT-4 in education: Evaluating aptness, reliability, and loss of coherence in solving calculus problems and grading submissions. International Journal of Artificial Intelligence in Education.
[https://doi.org/10.1007/s40593-024-00403-3]
-
Han, S. J., Ransom, K. J., Perfors, A., & Kemp, C. (2024). Inductive reasoning in humans and large language models. Cognitive Systems Research, 83, 101155.
[https://doi.org/10.1016/j.cogsys.2023.101155]
-
Javaji, P., Sreeya, P. S., & Rajesh, S. (2024, July). Detection of AI generated text with BERT model. Paper presented at the 2nd World Conference on Communication & Computing (WCONF), Raipur, India.
[https://doi.org/10.1109/WCONF61366.2024.10692072]
-
Jürgens, P., & Stark, B. (2022). Mapping exposure diversity: The divergent effects of algorithmic curation on news consumption. Journal of Communication, 72(3), 322-344.
[https://doi.org/10.1093/joc/jqac009]
-
Kim, S., Kim, S., Kim, Y., Park, J., Kim, S., ... & Lee, Y. (2023, November). LLMs analyzing the analysts: Do BERT and GPT extract more value from financial analyst reports? Paper presented at the 4th ACM International Conference on AI in Finance (ICAIF ’23), Brooklyn, NY.
[https://doi.org/10.1145/3604237.3627721]
- Krippendorff, K. (2012). Content analysis: An introduction to its methodology (3rd ed.). Sage Publications.
-
Lamprianou, I. (2023). Measuring and visualizing coders’ reliability: New approaches and guidelines from experimental data. Sociological Methods & Research, 52(1), 525-553.
[https://doi.org/10.1177/0049124120926198]
-
Lee, V. V., van der Lubbe, S. C., Goh, L. H., & Valderas, J. M. (2024). Harnessing ChatGPT for thematic analysis: Are we ready? Journal of Medical Internet Research, 26, e54974.
[https://doi.org/10.2196/54974]
-
Li, L., Ma, Z., Fan, L., Lee, S., Yu, H., & Hemphill, L. (2024). ChatGPT in education: A discourse analysis of worries and concerns on social media. Education and Information Technologies, 29(9), 10729-10762.
[https://doi.org/10.1007/s10639-023-12256-9]
-
Miah, M. S. U., Kabir, M. M., Sarwar, T. B., Safran, M., Alfarhood, S., & Mridha, M. F. (2024). A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Scientific Reports, 14(1), 9603.
[https://doi.org/10.1038/s41598-024-60210-7]
-
Parker, M. J., Anderson, C., Stone, C., & Oh, Y. (2024). A large language model approach to educational survey feedback analysis. International Journal of Artificial Intelligence in Education.
[https://doi.org/10.1007/s40593-024-00414-0]
-
Pelaez, S., Verma, G., Ribeiro, B., & Shapira, P. (2024). Large-scale text analysis using generative language models: A case study in discovering public value expressions in AI patents. Quantitative Science Studies, 5(1), 153-169.
[https://doi.org/10.1162/qss_a_00285]
-
Peng, C., Yang, X. I., Smith, K. E., Yu, Z., Chen, A., Bian, J., & Wu, Y. (2024). Model tuning or prompt tuning? A study of large language models for clinical concept and relation extraction. Journal of Biomedical Informatics, 153, 104630.
[https://doi.org/10.1016/j.jbi.2024.104630]
-
Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., ... & Wellman, M. (2019). Machine behaviour. Nature, 568(7753), 477-486.
[https://doi.org/10.1038/s41586-019-1138-y]
-
Rains, S. A., Harwood, J., Shmargad, Y., Kenski, K., Coe, K., & Bethard, S. (2023). Engagement with partisan Russian troll tweets during the 2016 US presidential election: A social identity perspective. Journal of Communication, 73(1), 38-48.
[https://doi.org/10.1093/joc/jqac037]
-
Rasool, Z., Kurniawan, S., Balugo, S., Barnett, S., Vasa, R., Chesser, C., ... & Bahar-Fuchs, A. (2024). Evaluating LLMs on document-based QA: Exact answer selection and numerical extraction using CogTale dataset. Natural Language Processing Journal, 8, 100083.
[https://doi.org/10.1016/j.nlp.2024.100083]
-
Song, H., Tolochko, P., Eberl, J.-M., Eisele, O., Greussing, E., Heidenreich, T., ... & Boomgaarden, H. G. (2020). In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication, 37(4), 550–572.
[https://doi.org/10.1080/10584609.2020.1723752]
-
Su, L. Y. F., Cacciatore, M. A., Liang, X., Brossard, D., Scheufele, D. A., & Xenos, M. A. (2016). Analyzing public sentiments online: Combining human- and computer-based content analysis. Information, Communication & Society, 20(3), 406–427.
[https://doi.org/10.1080/1369118X.2016.1182197]
-
Suter, V., & Meckel, M. (2024, June). Using GPT-4 for text analysis: Insights from English and German language news classification tasks. Paper presented at the 1st Workshop on Reliable Evaluation of LLMs for Factual Information (REAL-Info 2024), Buffalo, NY.
[https://doi.org/10.36190/2024.31]
-
Tai, R. H., Bentley, L. R., Xia, X., Sitt, J. M., Fankhauser, S. C., Chicas-Mosier, A. M., & Monteith, B. G. (2024). An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods, 23.
[https://doi.org/10.1177/16094069241231168]
-
Van Atteveldt, W., Van der Velden, M. A., & Boukes, M. (2021). The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2), 121-140.
[https://doi.org/10.1080/19312458.2020.1869198]
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.