CMU-CS-24-155
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-24-155

Analyzing Multimodal Machine Learning Model Performance
and Evaluation Metrics for Medical Report Generation

Ankit Gupta

M.S. Thesis

December 2024

CMU-CS-24-155.pdf


Keywords: Multimodal Learning, Vision-Language Models, Medical Report Generation

As a result of recent advancements in foundation models, including large vision- language models, several researchers have explored methods of combining multiple modalities of data as inputs for visual question answering. One key application of visual question answering in the context of the healthcare domain is automated medical report generation, where chest X-ray images and text-based symptom data for a patient might be provided as inputs, with the intention of generating a relevant medical report as an output. However, very few studies analyze the performance of these models alongside unimodal fine-tuned LLMs, and even fewer compare the performance of these multimodal models depending on whether they are provided symptom information as an input. Furthermore, past studies often use simple evaluation metrics that look at n-gram overlaps, such as BLEU and ROUGE scores, which are not effective for generative foundation models that can generate different sentences with the same semantic meaning.

In this paper, we present two main contributions. First, we compare the perfor- mance of a variety of approaches for generating medical reports on a dataset of chest X-Ray medical reports, including a unimodal fine-tuned medical LLM, a multimodal model without symptom data, and a multimodal model with symptom data. Second, we introduce four new metrics for evaluating the similarity between generated and reference medical reports, which we term Word Pairs, Sentence Average, Sentence Pairs, and Sentence Pairs (Bio). Our results show that multimodal approaches to medical report generation far outperform unimodal approaches, and providing symptom data slightly improves accuracy for generated medical reports. We also find that our newly introduced Sentence Pairs evaluation metric more closely measures similarity between generated and reference medical reports than all prior metrics, as evidenced by thorough quantitative and qualitative case study comparisons.

This research fundamentally pushes the frontier of medical report generation by further reinforcing the accuracy benefits of using multimodal models with symptom inputs and introducing several more comprehensive, customized scoring metrics for evaluating generated medical reports.

107 pages

Thesis Committee:
Min Xu (Chair)
Martin Zhang
Bryan Wilder

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu