An investigation into the validity of some metrics for automatically evaluating NLG systems

Belz, Anja and Reiter, Ehud (2009) An investigation into the validity of some metrics for automatically evaluating NLG systems Computational Linguistics, 35 (4). pp. 529-558. ISSN 0891-2017

Full text not available from this repository.

Official URL: http://www.mitpressjournals.org/doi/abs/10.1162/co...

Abstract

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous workon NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

Item Type:Article
Subjects:G000 Computing and Mathematical Sciences > G400 Computing
Identification Number:10.1162/coli.2009.35.4.35405
Faculties:Faculty of Science and Engineering > School of Computing, Engineering and Mathematics > Natural Language Technology
ID Code:7009
Deposited By:editor cmis
Deposited On:12 Mar 2010
Last Modified:20 Apr 2012 10:05

Repository Staff Only: item control page