Using Bug Descriptions to Reformulate Queries during Text-Retrieval-based Bug Localization

  Oscar Chaparro, Juan Manuel Florez, and Andrian Marcus

  Empirical Software Engineering (EMSE)

Abstract: Text Retrieval (TR)-based approaches for bug localization rely on formulating an initial query based on the full text of a bug report. When the query fails to retrieve the buggy code artifacts, developers can reformulate the query and retrieve more candidate code documents. Existing research on query reformulation focuses mostly on leveraging relevance feedback from the user or on expanding the original query with additional information. We hypothesize that the title of the bug reports, the observed behavior, expected behavior, steps to reproduce, and code snippets provided by the users in bug descriptions, contain the most relevant information for retrieving the buggy code artifacts, and that other parts of the descriptions contain more irrelevant terms, which hinder retrieval. This paper proposes and evaluates a set of query reformulation strategies based on the selection of existing information in bug descriptions, and the removal of irrelevant parts from the original query. The results show that selecting the bug report title and the observed behavior is the strategy that performs best across various TR-based bug localization approaches and code granularities, as it leads to retrieving the buggy code artifacts within the top-N results for 25.6% more queries (on average) than without query reformulation. This strategy is highly applicable and consistent across different thresholds N. Selecting the steps to reproduce or the expected behavior (when provided in the bug reports) along with the bug title and the observed behavior leads to higher performance (i.e., between 31.4% and 41.7% more queries) and comparable consistency, yet it is applicable in fewer cases. These reformulation strategies are easy to use and are independent of the underlying retrieval technique.