Comparing rating scales and preference judgements in language evaluation

BELZ, ANJA and Kow, Eric (2010) Comparing rating scales and preference judgements in language evaluation In: Proceedings of the 6th International Language Generation Congerence (INLG'10), July 7 - 9, 2010, Dublin, Ireland.

Full text not available from this repository.

Official URL: http://dl.acm.org/citation.cfm?id=1873743

Abstract

Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.

Item Type:Contribution to conference proceedings in the public domain ( Full Paper)
Subjects:G000 Computing and Mathematical Sciences
DOI (a stable link to the resource):10.1.1.167.7542
Faculties:Faculty of Science and Engineering > School of Computing, Engineering and Mathematics > Natural Language Technology
ID Code:8059
Deposited By:Converis
Deposited On:07 Jan 2011 12:19
Last Modified:07 Feb 2013 03:04

Repository Staff Only: item control page