The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Zhou, Shan; Luo, Xiao; Chen, Chan; Jiang, Hong; Yang, Chun; Ran, Guanghui; Yu, Juan; Yin, Chengliang

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Zhou, Shan;Luo, Xiao;Chen, Chan;Jiang, Hong;Yang, Chun;Ran, Guanghui;Yu, Juan;Yin, Chengliang;

International journal of surgery (London, England) 2024

38

zhou2024theinternational

Abstract

Large language model (LLM)-powered chatbots have become increasingly prevalent in healthcare, while their capacity in oncology remains largely unknown. To evaluate the performance of LLM-powered chatbots compared to oncology physicians in addressing to colorectal cancer queries.This study was conducted between August 13, 2023, and January 5, 2024. A total of 150 questions were designed, and each question was submitted three times to eight chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Turbo, Doctor GPT, Llama-2-70B, Mixtral-8x7B, Bard, and Claude 2.1. No feedback was provided to these chatbots. The questions were also answered by nine oncology physicians, including three residents, three fellows, and three attendings. Each answer was scored based on its consistency with guidelines, with a score of 1 for consistent answers and 0 for inconsistent answers. The total score for each question was based on the number of corrected answers, ranging from 0 to 3. The accuracy and scores of the chatbots were compared to those of the physicians.Claude 2.1 demonstrated the highest accuracy, with an average accuracy of 82.67%, followed by Doctor GPT at 80.45%, ChatGPT-4 Turbo at 78.44%, ChatGPT-4 at 78%, Mixtral-8x7B at 73.33%, Bard at 70%, ChatGPT-3.5 at 64.89%, and Llama-2-70B at 61.78%. Claude 2.1 outperformed residents, fellows, and attendings. Doctor GPT outperformed residents and fellows. Additionally, Mixtral-8x7B outperformed residents. In terms of scores, Claude 2.1 outperformed residents and fellows. Doctor GPT, ChatGPT-4 Turbo and ChatGPT-4 outperformed residents.This study shows that LLM-powered chatbots can provide more accurate medical information compared to oncology physicians.

Keywords

artificial intelligence ai chatbot latarjet shoulder instability generative llm large language model

Access

DOI:

10.1097/JS9.0000000000001850

Citation

ID: 279332

Ref Key: zhou2024theinternational

Use this key to autocite in SciMatic or Thesis Manager

References

No Bibliography

Blockchain Verification

Account:

NFT Contract Address:

0x95644003c57E6F55A65596E3D9Eac6813e3566dA

Article ID:

279332

Unique Identifier:

10.1097/JS9.0000000000001850

Network:

Scimatic Chain (ID: 481)

Blockchain Readiness Checklist

Authors

Abstract

Journal Name

Year

Title

5/5

Creates 1,000,000 NFT tokens for this article

Token Features:

ERC-1155 Standard NFT
1 Million Supply per Article
Transferable via MetaMask
Permanent Blockchain Record

Scan with Saymatik Web3.0 Wallet

Gas fees required in SCI Coins

Buy SCI

Saymatik Web3.0 Wallet

Google Play

App Store

Coming soon

Reference Key: lastname+year+titlefirstword+journalfirstword

Article Type (Article, Book, Proceedings etc.)

Add a reference in a raw form. Our automatic system will correct it later.

The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries.

Abstract

Keywords

Access

Citation

References

References

Blockchain Verification

Blockchain Readiness Checklist

Article Tokenized!

Token Features:

Saymatik Web3.0 Wallet