An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.

An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.

Hoang, Timothy;Liou, Lathan;Rosenberg, Ashley M;Zaidat, Bashar;Duey, Akiro H;Shrestha, Nancy;Ahmed, Wasil;Tang, Justin;Kim, Jun S;Cho, Samuel K;
Journal of neurosurgery. Spine 2024 pp. 1-11
43
hoang2024anjournal

Abstract

The objective of this study was to assess the safety and accuracy of ChatGPT recommendations in comparison to the evidence-based guidelines from the North American Spine Society (NASS) for the diagnosis and treatment of cervical radiculopathy.ChatGPT was prompted with questions from the 2011 NASS clinical guidelines for cervical radiculopathy and evaluated for concordance. Selected key phrases within the NASS guidelines were identified. Completeness was measured as the number of overlapping key phrases between ChatGPT responses and NASS guidelines divided by the total number of key phrases. A senior spine surgeon evaluated the ChatGPT responses for safety and accuracy. ChatGPT responses were further evaluated on their readability, similarity, and consistency. Flesch Reading Ease scores and Flesch-Kincaid reading levels were measured to assess readability. The Jaccard Similarity Index was used to assess agreement between ChatGPT responses and NASS clinical guidelines.A total of 100 key phrases were identified across 14 NASS clinical guidelines. The mean completeness of ChatGPT-4 was 46%. ChatGPT-3.5 yielded a completeness of 34%. ChatGPT-4 outperformed ChatGPT-3.5 by a margin of 12%. ChatGPT-4.0 outputs had a mean Flesch reading score of 15.24, which is very difficult to read, requiring a college graduate education to understand. ChatGPT-3.5 outputs had a lower mean Flesch reading score of 8.73, indicating that they are even more difficult to read and require a professional education level to do so. However, both versions of ChatGPT were more accessible than NASS guidelines, which had a mean Flesch reading score of 4.58. Furthermore, with NASS guidelines as a reference, ChatGPT-3.5 registered a mean ± SD Jaccard Similarity Index score of 0.20 ± 0.078 while ChatGPT-4 had a mean of 0.18 ± 0.068. Based on physician evaluation, outputs from ChatGPT-3.5 and ChatGPT-4.0 were safe 100% of the time. Thirteen of 14 (92.8%) ChatGPT-3.5 responses and 14 of 14 (100%) ChatGPT-4.0 responses were in agreement with current best clinical practices for cervical radiculopathy according to a senior spine surgeon.ChatGPT models were able to provide safe and accurate but incomplete responses to NASS clinical guideline questions about cervical radiculopathy. Although the authors' results suggest that improvements are required before ChatGPT can be reliably deployed in a clinical setting, future versions of the LLM hold promise as an updated reference for guidelines on cervical radiculopathy. Future versions must prioritize accessibility and comprehensibility for a diverse audience.

Citation

ID: 279328
Ref Key: hoang2024anjournal
Use this key to autocite in SciMatic or Thesis Manager

References

Blockchain Verification

Account:
NFT Contract Address:
0x95644003c57E6F55A65596E3D9Eac6813e3566dA
Article ID:
279328
Unique Identifier:
10.3171/2024.4.SPINE231148
Network:
Scimatic Chain (ID: 481)
Loading...
Blockchain Readiness Checklist
Authors
Abstract
Journal Name
Year
Title
5/5
Creates 1,000,000 NFT tokens for this article
Token Features:
  • ERC-1155 Standard NFT
  • 1 Million Supply per Article
  • Transferable via MetaMask
  • Permanent Blockchain Record
Blockchain QR Code
Scan with Saymatik Web3.0 Wallet

Saymatik Web3.0 Wallet