Using large language models to evaluate biomedical query-focused summarisation

Hashem Hijazi, Diego Mollá*, Vincent Nguyen, Sarvnaz Karimi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively..

Original languageEnglish
Title of host publicationProceedings of the 23rd Workshop on Biomedical Natural Language Processing
Place of PublicationStroudsburg
PublisherAssociation for Computational Linguistics
Pages236-242
Number of pages7
ISBN (Electronic)9798891761308
DOIs
Publication statusPublished - 2024
Event23rd Meeting of the ACL Special Interest Group on Biomedical Natural Language Processing, BioNLP 2024 - Bangkok, Thailand
Duration: 16 Aug 202416 Aug 2024

Conference

Conference23rd Meeting of the ACL Special Interest Group on Biomedical Natural Language Processing, BioNLP 2024
Country/TerritoryThailand
CityBangkok
Period16/08/2416/08/24

Cite this