Database Credentialed Access

Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries

Stefan Hegselmann Shannon Shen Florian Gierse Monica Agrawal David Sontag Xiaoyi Jiang

Published: April 28, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Hegselmann, S., Shen, S., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2024). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries (version 1.0.0). PhysioNet. https://doi.org/10.13026/a66y-aa53.

Additionally, please cite the original publication:

Hegselmann, S., Shen, S. Z., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2024). A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. arXiv preprint arXiv:2402.15422.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).


Background

Many patients do not understand the events that occurred during their hospitalization and the subsequent actions they need to take [1]. For instance, [2] performed post-discharge interviews and found that only 59.6% of the patients were able to accurately describe their admission diagnosis and 43.9% could fully describe their scheduled follow-up appointments. An improved discharge communication is associated with lower hospital readmission rates and higher adherence to treatment regiment [3]. A potential intervention to improve patient informedness could be patient-oriented summaries that describe all relevant facts in layperson language [4]. However, writing high-quality patient summaries is a difficult and time-consuming task [5] and healthcare workers already face high workloads [6,7].

Large language models (LLMs) have demonstrated strong capabilities on many natural language tasks including medical summarization [8]. However, LLMs are prone to generate unsupported or erroneous facts also referred to as hallucinations [9]. In healthcare, this issue is further aggravated by the fragmented nature of healthcare data as datasets often do not perfectly mimic the data available at the point of care. For example, datasets for medical summarization may not include the full patient history to accompany the written summarization leading to "hallucinations" in the human-written summary. Training or fine-tuning on this data replicates these artifacts. Several techniques for preventing hallucinations have been studied [10]. However, hallucinations can highly vary in complexity escaping automatic detection and making careful human annotation necessary [11]. This also applies to medical summaries [12].


Methods

MIMIC Datasets

First, we created a dataset of doctor-written patient summaries with different contexts that could be used to generate these summaries. We used the MIMIC-IV-Note v2.2 database which includes 331,793 deidentified free-text clinical notes from 145,915 patients admitted to the Beth Israel Deaconess Medical Center in Boston, MA, USA [13,14].

  1. Selecting Patient Summary: We used the Discharge Instructions section of the MIMIC-IV-Note discharge notes as patient summaries.
  2. Data preprocessing: Many summaries contained irrelevant artifacts that could distort the downstream analysis. Hence, we designed a preprocessing pipeline that filtered poor summaries and removed irrelevant content (for details see paper). As a result we kept 100,175 of the original 331,793 discharge notes.
  3. Selecting Context: We considered different contexts for our experiments that serves as information to create a summary. For our experiments, we only chose the Brief Hospital Course (BHC) section as context since it contains the most relevant information about the hospital course written for medical professionals. We chose this shorter context to reduce the effort for the human annotators and to better fit it into the models' context windows. The resulting dataset is named MIMIC-IV-Note-Ext-DI-BHC. We also release a version with all notes prior to the Discharge Instructions as context named MIMIC-IV-Note-Ext-DI.
  4. Selecting Subset for Annotation: To facilitate human annotation of the data we further filtered the data for contexts with a length of at most 4,000 characters and summaries with at least 600 characters yielding MIMIC-IV-Note-Ext-DI-BHC-Anno containing 26,178 entries. This was done to reduce the amount of context to take into account for the annotators and to increase the information in the summaries.

As a result, we will release three datasets of doctor-written patient summaries with the full context, the Brief Hospital Course as context, and a subset of the second to facilitate human annotation: MIMIC-IV-Note-Ext-DI, MIMIC-IV-Note-Ext-DI-BHC, MIMIC-IV-Note-Ext-DI-BHC-Anno

Hallucination Datasets Annotated by Medical Experts

We developed a protocol for labeling token-level errors in medical texts based on [15,16], which is available on Github [17]. We distinguished unsupported, contradicted, incorrect facts. Unsupported facts were further distinguished into nine subcategories. We treated the context (BHC) as the only ground truth about the patient. We chose this approach to reduce the labeling burden as annotators could not be expected to review all notes and structured information of a patient. However, since patient summaries do not only contain patient-specific information, we did allow general medical knowledge and advice even if not explicitly provided in the context (e.g., "Please take your medications as prescribed").

The labeling was carried out by two German medical students in their sixth year. They had completed their second state examination (USMLE Step 2 equivalent) and were working in the hospital. We utilized MedTator for annotation. For annotator training, we used twelve examples. Two examples were used to familiarize with the task and two times five examples were labeled separately and discussed for training. For the final labeling, the annotators worked independently and reached a consensus through discussion. More details can be found in [18].

  1. Annotating Unsupported Facts in Doctor-Written Patient Summaries: We selected 100 random examples from MIMIC-IV-Note-Ext-DI-BHC-Anno and medical experts annotated unsupported facts in the patient summaries yielding the dataset Hallucinations-MIMIC-DI. It is important to note that unsupported facts in doctor-written summaries are common in healthcare practice and usually should not be regarded as errors. Doctors may include information in the summary that was never documented, that was documented outside the considered context (in our case, only the BHC), or that was altered just prior to discharge.
  2. Annotating Hallucinations in Generated Patient Summaries: We chose 20 held-out contexts from MIMIC-IV-Note-Ext-DI-BHC-Anno and used the five models trained for the data-centric hallucination reduction experiments to generate summaries. Again, medical experts annotated hallucinations in these summaries with our protocol yielding Hallucinations-Generated-DI.

Derived Datasets from Hallucinations-MIMIC-DI

Based on Hallucinations-MIMIC-DI, we derived three additional datasets for our data-centric hallucination reduction experiments and qualitative evaluation. Original contains the same examples as Hallucinations-MIMIC-DI, Cleaned contains the examples with hallucinations manually removed or replaced, and Cleaned & Improved contains the examples for which further mistakes and artifacts were corrected.


Data Description

MIMIC Datasets

The datasets are provided as JSONL files with one context-summary pair per line as JSON dictionary with "text" as key for the context and "summary" for the summary.

  • MIMIC-IV-Note-Ext-DI (/mimic-iv-note-ext-di/dataset/all.json): 100,175 context-summary pairs that were filtered and preprocessed from MIMIC-IV-Note (additional details see paper). The context contains all text before the Discharge Instructions section that was used as patient summary.
  • MIMIC-IV-Note-Ext-DI-BHC (/mimic-iv-note-ext-di-bhc/dataset/all.json): 100,175 context-summary pairs from MIMIC-IV-Note-Ext-DI with shorter context (Brief Hospital Course).
  • MIMIC-IV-Note-Ext-DI-BHC-Anno (/mimic-iv-note-ext-di-bhc/dataset/*_4000_600_chars.json): 26,178 context-summary pairs, which are a subset of MIMIC-IV-Note-Ext-DI-BHC with contexts ≤ 4,000 characters and summaries ≥ 600 characters to facilitate human annotation.

Hallucination Datasets Annotated by Medical Experts

The datasets have the same JSONL as the MIMIC datasets (entries for "text" and "summary") containing an additional entry "labels" with the agreed upon hallucination annotations. Each annotation contains a "start" and "end" character of the span, a "length" of characters, and the annotated "text". Also, there is a "label" entry giving one of the eleven labels introduced in our annotation protocol. We also provide the annotations as XML files in the BioC-format. The datasets are in: /hallucination_datasets.

  • Hallucinations-MIMIC-DI: 100 random context-summary pairs from MIMIC-IV-Note-Ext-DI-BHC-Anno with unsupported facts annotated and agreed upon by two medical experts.
  • Hallucinations-MIMIC-DI-Valid: 10 random validation context-summary pairs from MIMIC-IV-Note-Ext-DI-BHC-Anno with hallucinations annotated and agreed upon by two medical experts.
  • Hallucinations-Generated-DI: 100 context-summary pairs based on 20 random contexts from MIMIC-IV-Note-Ext-DI-BHC-Anno and summaries generated with five different models.

Derived Datasets from Hallucinations-MIMIC-DI

The datasets are provided as JSONL files with one context-summary pair per line as JSON dictionary with "text" as key for the context and "summary" for the summary. The datasets are in: /derived_datasets.

  • Original: 100 context-summary pairs from Hallucinations-MIMIC-DI.
  • Cleaned: 100 context-summary pairs from Original with labeled unsupported facts manually removed or replaced.
  • Cleaned & Improved: 100 context-summary pairs from Cleaned with mistakes and artifacts removed or corrected.

Usage Notes

An example usage in Python are the experiments carried out in the corresponding paper. The code is available on Github [17]. Common use-cases of the published is listed below:

  • MIMIC Datasets: The MIMIC dataset contains a preprocessed and cleaned version of the Discharge Instructions, hence, they could be a useful starting point for machine learning experiments working with this section of the MIMIC-IV-Note discharge notes. We provide versions with two different contexts.

  • Hallucination Datasets Annotated by Medical Experts: The datasets contain labels for unsupported facts in 100 doctor-written and hallucinations 100 generated patient summaries. They can be used to evaluate automatic hallucination detection methods and to train automatic approaches for hallucination reduction.

  • Derived Datasets from Hallucinations-MIMIC-DI: The derived datasets contain the 100 doctor-written summaries with unsupported facts removed and with unsupported facts removed and improved language. They can be used to fine-tune or prompt an LLM with high-quality examples to generate patient summaries with less hallucinations and higher quality. The data can also serve for better evaluation of patient summary generation as it contains higher quality examples.


Release Notes

Version 1.0.0

First publicly available version of the data that was used in the original paper.


Ethics

The authors declare no ethics concerns.


Acknowledgements

The generated patient summaries were computed on the HPC cluster PALMA II of the University of Münster, subsidised by the DFG (INST 211/667-1).


Conflicts of Interest

None to declare.


References

  1. Kebede, S., Shihab, H. M., Berger, Z. D., Shah, N. G., Yeh, H. C., & Brotman, D. J. (2014). Patients’ understanding of their hospitalizations and association with satisfaction. JAMA internal medicine, 174(10), 1698-1700.
  2. Horwitz, L. I., Moriarty, J. P., Chen, C., Fogerty, R. L., Brewster, U. C., Kanade, S., ... & Krumholz, H. M. (2013). Quality of discharge practices and patient understanding at an academic medical center. JAMA internal medicine, 173(18), 1715-1722.
  3. Becker, C., Zumbrunn, S., Beck, K., Vincent, A., Loretz, N., Müller, J., ... & Hunziker, S. (2021). Interventions to improve communication at hospital discharge and rates of readmission: a systematic review and meta-analysis. JAMA Network Open, 4(8), e2119346-e2119346.
  4. Federman, A., Sarzynski, E., Brach, C., Francaviglia, P., Jacques, J., Jandorf, L., ... & Kannry, J. (2018). Challenges optimizing the after visit summary. International journal of medical informatics, 120, 14-19.
  5. Mueller, S. K., Giannelli, K., Boxer, R., & Schnipper, J. L. (2015). Readability of patient discharge instructions with and without the use of electronically available disease-specific templates. Journal of the American Medical Informatics Association, 22(4), 857-863.
  6. Phillips, C. (2020). Relationships between workload perception, burnout, and intent to leave among medical–surgical nurses. JBI Evidence Implementation, 18(2), 265-273.
  7. Watson, A. G., McCoy, J. V., Mathew, J., Gundersen, D. A., & Eisenstein, R. M. (2019). Impact of physician workload on burnout in the emergency department. Psychology, health & medicine, 24(4), 414-428.
  8. Van Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J. B., Aali, A., Bluethgen, C., ... & Chaudhari, A. S. (2024). Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine, 1-9.
  9. Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
  10. Huang, Y., Feng, X., Feng, X., & Qin, B. (2021). The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839.
  11. Thomson, C., Reiter, E., & Sundararajan, B. (2023). Evaluating factual accuracy in complex data-to-text. Computer Speech & Language, 80, 101482.
  12. Moramarco, F., Korfiatis, A. P., Perera, M., Juric, D., Flann, J., Reiter, E., ... & Savkov, A. (2022). Human evaluation and correlation with automatic metrics in consultation note generation. arXiv preprint arXiv:2204.00447.
  13. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
  14. Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation, 101(23), e215-e220.
  15. Thomson, C., & Reiter, E. (2020). A gold standard methodology for evaluating accuracy in data-to-text systems. arXiv preprint arXiv:2011.03992.
  16. Thomson, C., & Reiter, E. (2021). Generation challenges: Results of the accuracy evaluation shared task. arXiv preprint arXiv:2108.05644.
  17. Code for "A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models". Available from: https://github.com/stefanhgm/patient_summaries_with_llms [4/21/2024]
  18. Hegselmann, S., Shen, S. Z., Gierse, F., Agrawal, M., Sontag, D., & Jiang, X. (2024). A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. arXiv preprint arXiv:2402.15422.

Parent Projects
Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files