400 Section - A comparative analysis of Large Language Models’ performances on concussion topics

Wednesday, May 20, 2026

6:00 PM - 7:30 PM PT

Presenting Author/Autor expositor(s)

TG

Tiffany Got, MD

Medical Resident
University of Toronto
Toronto, Ontario, Canada

Co-Author/Coautor(s)

AK

Abirami Kirubarajan, MD

Medical Resident
McMaster University
Toronto, Ontario, Canada
VZ

Vivian Weixuan Zhang, MASc

Medical Student
The University of British Columbia
Vancouver, British Columbia, Canada
NG

Nicolas Mathieu Gauthier, BS

Medical Student
University of OTtawa
Ottawa, Ontario, Canada
JC

Joy Chowdhury, BHSc

Student
Lawson Research Institute at St. Joseph’s Health Care London, Ontario, Canada
London, Ontario, Canada
SD

Suchnoor Dhillon, n/a

Medical Student
McMaster University
St. Catharines, Ontario, Canada

Objectives : The purpose of this study was to compare the quality of responses from three LLMs on concussion management against the Living Concussion guidelines.

Design:

Three LLMs (ChatGPT, Perplexity, Open Evidence) were assessed in this cross-sectional performance analysis. Patient questions were generated based on the Living Concussion guidelines in four domains: Diagnosis, Prognosis, Treatment - Acute and Treatment- Chronic. Responses were assessed in five domains via the S.C.O.R.E framework (Safety, Consensus with Guidelines, Objectivity, Reliability and Explainability. The validated Patient Education Materials Assessment Tool (PEMAT-p) was used to assess understandability. Each response was independently graded by four independent reviewers on a 5-point Likert scale.

Results:

A total of 132 responses were evaluated in the 5 domains of the S.C.O.R.E. framework. All 3 LLMs performed best in the domain of reproducibility (4.75 ± 0.43; 4.75 ± 0.43; 5.00 ± 0.00 for ChatGPT, Perplexity, Open Evidence respectively) and poorest in the domain of explainability (1.71 ± 1.29; 1.44 ± 0.87; 3.16 ± 1.59 for ChatGPT, Perplexity, Open Evidence respectively). The mean scores for consensus with guidelines and safety were not statistically different between LLMs (p=0.128 and p=0.70 respectively). However, there was a significant difference in the explainability between the LLMs (p< 0.0001). There were few (3/37) responses where there was moderate-major deviation from guidelines, although discrepancies were rated as “unlikely” to pose severe harm to the patient. All 3 LLMs scored moderately on the PEMAT-p, with scores ranging between 5/12 to 11/12.

Conclusion:

LLMs demonstrate ability to generate generally evidence-based and safe information for concussion rehabilitation. However, all 3 LLMs scored lowest in their explainability, underscoring the limited transparency of these models. As such, rehabilitation clinicians should be cognizant of the strengths and limitations of LLMs as an adjunct for patient education and clinical decision support.