This commit is contained in:
starlee 2017-06-14 20:27:16 +08:00
parent 6db3608dea
commit b50a5f0a8a
3 changed files with 30 additions and 18 deletions

View File

@ -156,25 +156,32 @@ and the subsequent pull-requests are referred to the \textit{duplicate pull-requ
\end{figure}
For example, Figure \ref{fig:example_dup_prs}
shows two duplicate pull-requests
both of which intend to \hl{[....]}
For example, Figure~\ref{fig:example_dup_prs}
shows a duplicate pull-request (\textit{Rails \#11869})
and its master pull-request (\textit{Rails \#11496}).
Both of them intend to resolve the problem of association
which is based on null relationship.
In \gh, in addition to commits (\ie file changes),
contributors also need to provide the summary title and detailed description
to elaborate on the submitted pull-request.
%%%%
%%% 大部分时候,同一个问题会用相似语句表达,但是具体的用词不同;甚至 同一个 root error 导致不同的failure
%%%
%%%%
However, GitHub does not provide an explicit way to
mark a pull-request as duplicate to another one.
In our study, the test dataset of duplicates is recognized by analysing review comments
which is elaborated in Section~\ref{sec:experiment}
From the figure, we can that the titles and descriptions
of these two pull-requests share some same words
which means natural language text can be used to measure their similarity.
Textual similarity has been actually applied by many precious
studies~\cite{Runeson2007Detection,Wang2008,Nguyen2012Duplicate,Lazar2014Improving}
to detect duplicate contents in software development (\eg bug reports).
But it is natural and common that difference of language expression exists
when different people are descriping the same thing
which is just reflected in the above two duplicate pull-requests,
that do not have too many same words.
However, compared with bug reports,
pull-requests contain more information, such as diff of file changes (\ie commits).
It is likely that developers will edit a same set of files
to fix the same bug or add the same feature.
Therefore, except for text information,
we also take into consideration of diff information
and investigate combining them together to better detect duplicate pull-requests.
% \begin{itemize}
% %说明 越晚识别越影响贡献者的持续贡献【这个rq要根据具体的实验数据决定是否添加如果添加了那么就要在intro里引出自动识别前提到这个紧迫性】
@ -189,7 +196,6 @@ which is elaborated in Section~\ref{sec:experiment}
% 用那个时序图来解释 [可以放到方法里讲收集数据的过程]
% !!!!!!!! RQ
% RQ0: 时间分布
% RQ1: Title Desc FileD LineD 各自的效果

View File

@ -10,7 +10,13 @@ We determine the similarity between pull-requests from two perspectives:
Text similarity is calculated based on the \hl{natural language text},
while diff similarity is calculated by comparing the file changes
contained in different pull-requests.
Finally, the combined similarity will be used to retrieve potential targe pull-requests.
Finally, the combined similarity will be used to retrieve potential targe pull-requests.
[However, GitHub does not provide an explicit way to
mark a pull-request as duplicate to another one.
In our study, the test dataset of duplicates is recognized by analysing review comments
which is elaborated in Section~\ref{sec:experiment}]
In the following sections, we will elaborate each step in detail.
\begin{figure}[ht]

View File

@ -9,7 +9,7 @@
}
@inproceedings{Li2017,
title={Automatic Classification of Review Comments in Pull-based Development Model},
author={ Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan and Huaimin Wang},
author={Li, Zhixing and Yu, Yue and Yin, Gang and Wang, Tao and Fan,Qiang and Wang, Huaimin},
booktitle={International Conference on Software Engineering and Knowledge Engineering},
year={2017},
}