GastroGPT: Successful proof-of-concept study of gastroenterology-specific large language model

Presented by

Dr Cem Şimşek, Hacettepe University, Turkey

Conference

UEGW 2023

Trial

GastroGPT

Doi
https://doi.org/10.55788/c6bd4c96

A first blinded systematic comparison of a speciality-specific large language model (LLM) outperformed general LLMs such as ChatGPT across key clinical tasks. The next step would be to compare the performance of this model to human performance.

Dr Cem Şimşek (Hacettepe University, Turkey) and his research team designed a specialty-specific LLM to perform clinical tasks in gastroenterology [1]. The current proof-of-concept study compared this gastroenterology-specific model named GastroGPT against 3 general state-of-the art LLMs, being chatGPT4, Google Bard, and Anthropic’s Claude. An expert panel with reviewers from varying sub-specialities across Europe compared the models' performances across 7 clinical tasks for various simulated patient cases. “The clinical tasks included assessment, collecting additional history, recommending diagnostic tests and treatment, patient education, planning follow-up visitations, and referring patients to specialists,” clarified Dr Şimşek. The cases varied in complexity, rarity, and setting/urgency. The expert panel rated the accuracy, relevance, alignment with clinical guidelines, usability, interpretability, and potential clinical impact of the models’ output on a 10-point Likert scale.

The experts executed 480 evaluations. GastroGPT had a higher overall score across tasks than the other models (8.05 vs 4.95, 5.63, and 6.92; P<0.001 for all) for all 10 cases that were evaluated. GastroGPT performed significantly better than all general models with respect to ‘overall evaluation’, ‘additional history’, and ‘referrals’, scored better than ChatGPT and Google Bard in terms of ‘assessment’, ‘treatment’, and ‘patient education’, and was associated with better outcomes than ChatGPT regarding ‘recommended diagnostic tests’ (see Figure). Furthermore, GastroGPT was more consistent than the other models across different cases, clinical tasks, complexity levels, and rarity (Levene’s test<0.001). Finally, the panel scored a Cronbach’s alpha of 0.76 for coherency.

Figure: Clinical task outcomes for GastroGPT and general LLMs [1]

LLM, large language model.

GastroGPT outperformed general-purpose LLMs across key clinical tasks, indicating that speciality-specific LLMs have potential in medical practice. However, this approach must be tested across specialities and compared with physicians’ real-world evaluations.

1. Şimşek C, et al. GastroGPT: first specialty-specific AI language model outperforms general models across key clinical tasks. LB16, UEG Week 2023, 14–17 October, Copenhagen, Denmark.

Medical Conferences

GastroGPT: Successful proof-of-concept study of gastroenterology-specific large language model

Table of Contents: UEGW 2023

Featured articles

Letter from the Editor

SEQUENCE: Risankizumab doubles endoscopic remission rates compared with ustekinumab in CD

UEGW 2023 Highlights Podcast

What’s New in Artificial Intelligence

Digital intervention relieves symptoms and improves QoL in IBS