Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination

Numan Mercan; Ahmet Yurteri; Ebubekir Eravşar; Ahmet Yıldırım

doi:10.5455/handmicrosurg.249704

2025, Vol: 14, Issue: 3

14 / 3Current Issue Online First Archive Aims and Scope Abstracting & Indexing Most Accessed Articles Most Downloaded Articles Most Cited Articles

Required files to be uploaded

Conflict of Interest Disclosure Statement
Copyright Transfer Form
Ethical Approval
Title Page

« Previous Article

Original Article

Hand Microsurg. 2025; 14(3): 97-105

doi: 10.5455/handmicrosurg.249704

Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination

Numan Mercan, Ahmet Yurteri, Ebubekir Eravşar, Ahmet Yıldırım.

Abstract
Objectives: In this study, the accuracy rates of the answers given by three different large language models (LLMs) (ChatGPT-4o, DeepSeek-R1, and Gemini 2.0) to the multiple-choice questions (MCQs) asked in the European Board of Hand Surgery (EBHS) exam and the reasons for the wrong answers were examined. It was hypothesized that the DeepSeek-R1 model would show a higher accuracy rate than the other two models based on reported differences in training datasets.
Materials and Methods: 10 different exams published in The Journal of Hand Surgery (European Volume) (between 2022- 2024) and 150 true/false MCQs were examined in the study. The MCQs divided into five subheadings according to the content of the questions, and these were anatomy, trauma, systemic-chronic diseases, microsurgery, and congenital disorders. The error reasons for the wrong answers of the models were divided into four groups, and these were data-related, semantic, algorithmic, and logical errors.
Results: ChatGPT-4o had a correct answer rate of 74%, DeepSeek-R1 76.7%, and Gemini 2.0 73.3%, and no significant difference was observed between these rates (p = 0.572). The models gave the same answer for 103 out of 150 MCQs, and 84.5% of these answers were correct. In the evaluation of wrong answers, it was seen that the most frequent type of error was data-related.
Conclusion: There was no significant difference in accuracy rates, content-based subcategories, or error types among the three LLMs. Data-related errors indicate gaps in training, but approximately 75% accuracy in this exam suggests that further error analysis could enhance future model performance.

Key words: artificial intelligence; board exam; ChatGPT; DeepSeek; error analysis; Gemini; hand surgery; large language models

	ARTICLE TOOLS
	Abstract
	PDF Fulltext
	How to cite this article
	Citation Tools
	Related Records
	Articles by Numan Mercan Articles by Ahmet Yurteri Articles by Ebubekir Eravşar Articles by Ahmet Yıldırım
	on Google
	on Google Scholar

How to Cite this Article

Pubmed Style

Mercan N, Yurteri A, Eravşar E, Yıldırım A. Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand Microsurg. 2025; 14(3): 97-105. doi:10.5455/handmicrosurg.249704

Web Style

Mercan N, Yurteri A, Eravşar E, Yıldırım A. Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. https://handmicrosurgeryjournal.com/?mno=249704 [Access: January 28, 2026]. doi:10.5455/handmicrosurg.249704

AMA (American Medical Association) Style

Mercan N, Yurteri A, Eravşar E, Yıldırım A. Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand Microsurg. 2025; 14(3): 97-105. doi:10.5455/handmicrosurg.249704

Vancouver/ICMJE Style

Mercan N, Yurteri A, Eravşar E, Yıldırım A. Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand Microsurg. (2025), [cited January 28, 2026]; 14(3): 97-105. doi:10.5455/handmicrosurg.249704

Harvard Style

Mercan, N., Yurteri, . A., Eravşar, . E. & Yıldırım, . A. (2025) Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand Microsurg, 14 (3), 97-105. doi:10.5455/handmicrosurg.249704

Turabian Style

Mercan, Numan, Ahmet Yurteri, Ebubekir Eravşar, and Ahmet Yıldırım. 2025. Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand and Microsurgery, 14 (3), 97-105. doi:10.5455/handmicrosurg.249704

Chicago Style

Mercan, Numan, Ahmet Yurteri, Ebubekir Eravşar, and Ahmet Yıldırım. "Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination." Hand and Microsurgery 14 (2025), 97-105. doi:10.5455/handmicrosurg.249704

MLA (The Modern Language Association) Style

APA (American Psychological Association) Style

Mercan, N., Yurteri, . A., Eravşar, . E. & Yıldırım, . A. (2025) Accuracy and Error Analysis of ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 in the European Board of Hand Surgery Examination. Hand and Microsurgery, 14 (3), 97-105. doi:10.5455/handmicrosurg.249704

About Hand and Microsurgery

Contact Information

How to cite this article



About Hand and Microsurgery Hand and Microsurgery, the official journal of Turkish Society for Surgery of Hand and Upper Extremity, is a peer-reviewed scientific journal published th ... Read more. For best results, please use Internet Explorer or Google Chrome.	Contact Information