Clinical Sciences/Health Conditions
Tiffany Got, MD
Medical Resident
University of Toronto
Toronto, Ontario, Canada
Abirami Kirubarajan, MD
Medical Resident
McMaster University
Toronto, Ontario, Canada
Vivian Weixuan Zhang, MASc
Medical Student
The University of British Columbia
Vancouver, British Columbia, Canada
Nicolas Mathieu Gauthier, BS
Medical Student
University of OTtawa
Ottawa, Ontario, Canada
Joy Chowdhury, BHSc
Student
Lawson Research Institute at St. Joseph’s Health Care London, Ontario, Canada
London, Ontario, Canada
Suchnoor Dhillon, n/a
Medical Student
McMaster University
St. Catharines, Ontario, Canada
Three LLMs (ChatGPT, Perplexity, Open Evidence) were assessed in this cross-sectional performance analysis. Patient questions were generated based on the Living Concussion guidelines in four domains: Diagnosis, Prognosis, Treatment - Acute and Treatment- Chronic. Responses were assessed in five domains via the S.C.O.R.E framework (Safety, Consensus with Guidelines, Objectivity, Reliability and Explainability. The validated Patient Education Materials Assessment Tool (PEMAT-p) was used to assess understandability. Each response was independently graded by four independent reviewers on a 5-point Likert scale.
Results:
A total of 132 responses were evaluated in the 5 domains of the S.C.O.R.E. framework. All 3 LLMs performed best in the domain of reproducibility (4.75 ± 0.43; 4.75 ± 0.43; 5.00 ± 0.00 for ChatGPT, Perplexity, Open Evidence respectively) and poorest in the domain of explainability (1.71 ± 1.29; 1.44 ± 0.87; 3.16 ± 1.59 for ChatGPT, Perplexity, Open Evidence respectively). The mean scores for consensus with guidelines and safety were not statistically different between LLMs (p=0.128 and p=0.70 respectively). However, there was a significant difference in the explainability between the LLMs (p< 0.0001). There were few (3/37) responses where there was moderate-major deviation from guidelines, although discrepancies were rated as “unlikely” to pose severe harm to the patient. All 3 LLMs scored moderately on the PEMAT-p, with scores ranging between 5/12 to 11/12.
Conclusion:
LLMs demonstrate ability to generate generally evidence-based and safe information for concussion rehabilitation. However, all 3 LLMs scored lowest in their explainability, underscoring the limited transparency of these models. As such, rehabilitation clinicians should be cognizant of the strengths and limitations of LLMs as an adjunct for patient education and clinical decision support.