Medical Student University of Western Ontario London, Ontario, Canada
Background: Large language models (LLMs) are becoming increasingly accessible for medical decision-making and are often used by medical practitioners and patients. However, prior studies have raised concerns about the accuracy of certain LLMs in assisting with patient management. The purpose of this study was to compare the performance of publicly available LLMs in managing cardiovascular patient care and to evaluate their effectiveness relative to cardiovascular specialists.
METHODS AND RESULTS: We assessed the performance of seven publicly available LLMs on validated cardiovascular antithrombotic care scenarios, which were evaluated by three independent clinicians for accuracy and reasoning. The results were compared to the performance of volunteer clinicians, based on a survey conducted at the 2023 Canadian Cardiovascular Congress. Statistical analyses ensured interobserver reliability and evaluated performance differences between models. Our findings reveal that Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (p < 0.001) and all clinician groups. Among the clinicians, cardiologists and senior residents achieved the highest accuracy rates (43% [CI95: 32-52%] and 47% [CI95: 39-56%] respectively), comparable to GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5 (22% [CI95:11-33%] vs. 26% vs. 30%), while medical students achieved 8.3% [CI95:2-15%] accuracy, closely aligning with GPT-3.5 (10%).
Conclusion: The performance of different LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, to some free-tier models providing inappropriate and potentially harmful medical advice. However, all tested models demonstrated acceptable performance for delivering patient recommendations regarding lifestyle and dietary management. Ultimately, clinicians and patients should exercise caution when using LLMs, select models appropriate for the task at hand, and cross-check provided references to ensure safe use of LLMs in practice.