TY - GEN
T1 - Evaluating dialogue generation systems via response selection
AU - Sato, Shiki
AU - Akama, Reina
AU - Ouchi, Hiroki
AU - Suzuki, Jun
AU - Inui, Kentaro
N1 - Funding Information:
This work was partially supported by JSPS KAK-ENHI Grant Number JP19H04162. We would like to thank the laboratory members who gave us advice and all reviewers of this work for their insightful comments.
Publisher Copyright:
© 2020 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose a method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.
AB - Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose a method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.
UR - http://www.scopus.com/inward/record.url?scp=85097342836&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097342836&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85097342836
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 593
EP - 599
BT - ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
Y2 - 5 July 2020 through 10 July 2020
ER -