Medical triage as an AI ethics benchmark - Nature

Medical triage as an AI ethics benchmark – Nature

In this work, we demonstrated the ability of LLMs to solve ethical dilemmas in the medical context. All models, except Mistral, consistently outperformed random guessing on the TRIAGE benchmark. This indicates that models do indeed have a good understanding of moral values as suggested by1 and that they are able to make sound moral decisions in the medical context.

The TRIAGE benchmark is based on real-world, high-stakes scenarios of ethical decision making, complementing existing ME benchmarks such as1 and4 that primarily rely on highly fictional scenarios created by researchers. By identifying significant differences between models, we demonstrate that TRIAGE is a viable alternative to traditional annotation-based methods for designing ME benchmarks. In addition to featuring real-world decision-making scenarios, a key advantage of TRIAGE is its focus on assessing explicit ethics. The benchmark requires models to explicitly choose an action in each scenario, which is crucial because a model may possess implicit knowledge of human values but still prioritize other values in its actions2.

Given the safety focus of our ME benchmarks, worst-case performance may be more critical than best-case performance. To capture a broader range of potential model behaviors, we included multiple syntax variations, jailbreaking attacks, and ethical contexts. All models, except Mistral, consistently outperformed random guessing even in their worst-performing condition. However, our findings show that the relative ranking of models varied between best- and worst-case performances. The best-case rankings (see Fig. 2a and 2b) align with expectations based on MT-Bench ratings9. Interestingly, Claude 3 Haiku, which scored lower on MT-Bench than GPT-4, outperformed it in some ethical dilemma scenarios. One possible explanation is that more capable models like GPT-4 may experience “competing objectives”11, where their enhanced instruction-following abilities conflict with safety training. However, Claude 3 Opus, considered as capable as GPT-4, did not show the same performance drop, suggesting that model architecture and training practices may be more predictive of ethical decision-making than general capability.

Our findings support three key hypotheses from2: (1) trustworthiness and utility (i.e., functional effectiveness) are often positively correlated, (2) proprietary LLMs tend to outperform open source LLMs on ME benchmarks, and (3) proprietary LLMs are often overly calibrated toward beneficence. To explore this further, we analyzed error distributions per model. We found that proprietary LLMs primarily made overcaring errors, while open-source LLMs mostly made undercaring errors. Undercaring errors involve actively neglecting a patient in need, which is arguably more grave than committing an overcaring error, in which a patient receives too many resources. However, as2 note, while proprietary models may perform better, the increased transparency of open-source models offers an important trade-off to consider.

In our tests, neutral question formulations led to the best model performance. Most ethics prompts, which ’remind’ models of a specific moral context, had no effect or worsened performance. This suggests that emphasizing ethical implications can impair decision making. While ethics prompts can be effective in some cases4, focusing on actions and their consequences often reduces performance. Therefore, when using LLMs to assist with ethical decisions in the medical context, it may be best to use “factual” prompts to encourage rational decision-making.

Complexity of Scenarios

We acknowledge that our benchmark represents a significant simplification of actual emergency medical decision-making, which constitutes the most substantial limitation of our work. Real mass casualty incidents involve dynamic, evolving situations where initial information may be incomplete, patient conditions change rapidly, and new casualties arrive continuously, requiring sequential decisions under extreme time pressure with fluctuating resources. Our static scenario approach cannot capture these critical aspects—in actual emergencies, patients must be constantly re-triaged as circumstances evolve, and medical professionals balance current needs against anticipated future demands. Like other ethics benchmarks such as MACHIAVELLI, we are constrained by current LLM technical limitations, particularly context window restrictions that prevent modeling extended sequential decision-making scenarios. However, we believe our approach still provides valuable foundational insights, as we observed significant differences between models and no ceiling effects, suggesting TRIAGE captures meaningful aspects of ethical reasoning even in simplified contexts. Future work should build toward more realistic emergency simulations with interactive, sequential decision-making that incorporates uncertainty, time pressure, and evolving information—extending our methodological contribution of using established societal frameworks rather than researcher-created scenarios to more sophisticated benchmarking approaches. We emphasize that this work does not suggest that the LLMs included in this study could or should be used for triage decision making in real-world scenarios.

Experimental design

While our experimental design randomized prompt presentation to each model, we acknowledge that potential order effects could influence model responses, as LLMs may exhibit sensitivity to the sequence in which different prompting conditions are encountered. Future work should consider randomized designs to systematically control for potential order effects, which could provide additional insights into the robustness of ethical decision-making patterns across different presentation sequences.

Ethical theories

Our study focused on utilitarianism and deontology because they represent clearly contrasting approaches to high-stakes scenarios with limited resources. Utilitarianism aligns closely with the spirit of triage (maximizing overall benefit), while deontology emphasizes duty-based principles that may conflict with resource optimization. These frameworks provide distinctly different framing effects that could influence AI decision-making in emergency situations. Future work should explore how different ethical framings—including virtue ethics, care ethics, or principlism—might produce distinct biases or decision patterns in AI systems. Understanding these framing effects is crucial because the same clinical scenario could yield different AI responses depending on which ethical lens is applied. Testing additional ethical frameworks would help map the full range of framing effects that ethical prompting can induce in AI decision-making.

Cross cultural validity of triage

The triage framework itself is used across many cultures. Though no single system has been internationally adopted, Triage is a globally adopted principle, and triage guidelines are used in many countries, with systems in use from Korea and Singapore to Saudi Arabia and China12,13,14,15. Given that there are various equally good medical triage models7, we chose START and jumpSTART due because come with ready-made patient scenarios and solutions already used to train medical professionals. This gave us realistic test cases instead of having to create artificial dilemmas from scratch, which enhanced the real-world relevance of our benchmark. We do not claim this is a comprehensive ethical standard, but it represents a rare opportunity where we have clear, established standards used to guide ethical decision-making of humans in high-stakes scenarios, which makes it a good benchmark for judging the ethical decision-making of LLMs. Nonetheless our specific implementation is clearly Western-biased. Our test questions and gold solutions were created by Western doctors, written in English, and based on START/jumpSTART protocols developed in the US. Future work should definitely create culturally adapted versions with scenarios developed by medical professionals from diverse backgrounds. Moreover, Triage models get updated as medical knowledge evolves. Our benchmark should evolve alongside these developments.

Human baselines

Unlike many AI benchmarks where human-level performance is the target, medical triage requires adherence to established protocols regardless of what average humans might do, and variability in human responses due to cultural differences or individual judgment doesn’t change the clinical gold standard. However, comparing LLM and human performance patterns would be valuable for future work, as understanding how AI systems differ from humans in their error patterns and decision-making under uncertainty could provide important insights for AI safety and deployment.

In conclusion, our work demonstrates that LLMs are capable of navigating complex ethical dilemmas in the medical domain. By incorporating real-world scenarios and requiring models to make explicit moral decisions, TRIAGE offers a more realistic to other ME benchmarks. Further, our approach does not rely on potentially unreliable human or AI annotations. Our findings suggest that while proprietary models generally perform better, particularly by avoiding undercaring errors, this comes with the risk of over-calibration. We further see that reminding models of an ethical context can worsen their decision making in emergency situations. Although TRIAGE is limited to the medical field and does not include open-ended scenarios, it provides valuable insights into the ethical decision making of LLMs.

Leave a Comment

Your email address will not be published. Required fields are marked *