Automatic Dialogue Summarization Evaluation: Combining Human Judgements with ROUGE and BERT Score
Abstract
Dialogue summarization remains a challenging task due to conversational ambiguity and informality. This work explores the relationship between human evaluation scores and automatic metrics for thirty dialogue summaries, and aims to identify whether these metrics represent human quality perception. Human evaluators rated summaries on accuracy, conciseness, meaning preservation, and grammar using a 5-point scale. Scores were compared with ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore. Results show a moderate alignment between human perception and BERTScore, while ROUGE metrics show weaker correlations. These findings suggest reference metrics capture semantic alignment more effectively than structural quality. Recommendations for improving summarization include incorporating factual consistency and compression-aware training. The study highlights the continued importance of human evaluation in conversational summarization research.
References
Li, M., Zhang, L., Ji, H., & Radke, R. J. (2019, July). Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2190-2196).
Zhang, A. X., & Cranshaw, J. (2018). Making sense of group chat through collaborative tagging and summarization. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1-27.
Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G. A., & Reynar, J. (2008, April). Building a sentiment summarizer for local service reviews. In WWW workshop on NLP in the information explosion era (Vol. 14, pp. 339-348).
Di Sorbo, A., Panichella, S., Alexandru, C. V., Visaggio, C. A., & Canfora, G. (2017, May). Surf: Summarizer of user reviews feedback. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C) (pp. 55-58). IEEE.
Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Gliwa, B., Mochol, I., Biesek, M., & Wawer, A. (2019). SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
Feng, X., Feng, X., & Qin, B. (2021). A survey on dialogue summarization: Recent advances and new frontiers. arXiv preprint arXiv:2107.03175.
Refbacks
- There are currently no refbacks.