On the Vocabulary Agreement in Software Issue Descriptions

Oscar Chaparro, Juan Manuel Florez, and Andrian Marcus

Proceedings of the 32nd IEEE International Conference on Software Maintenance and Evolution (ICSME'16), ERA track, pp. 448–452, 2016

[ pdf / package ]

Abstract: Many software comprehension tasks depend on how stakeholders textually describe their problems. These textual descriptions are leveraged by Text Retrieval (TR)-based solutions to more than 20 software engineering tasks, such as duplicate issue detection. The common assumption of such methods is that text describing the same issue in multiple places will have a common vocabulary. This paper presents an empirical study aimed at verifying this assumption and discusses the impact of the common vocabulary on duplicate issue detection. The study investigated 13K+ pairs of duplicate bug reports and Stack Overflow (SO) questions. We found that on average, more than 12.2% of the duplicate pairs do not have common terms. The other duplicate issue descriptions share, on average, 30% of their vocabulary. The good news is that these duplicates have significantly more terms in common than the non-duplicates. We also found that the difference between the lexical agreement of duplicate and non-duplicate pairs is a good predictor for the performance of TR-based duplicate detection.