168 lines
8.1 KiB
TeX
168 lines
8.1 KiB
TeX
|
||
\section{Introduction}
|
||
% background
|
||
A critical factor of the rapid development and evolution for
|
||
many open source software projects in GitHub
|
||
is the use of pull-based development model~\cite{Gousios:2014,Gousios:2014b,Gousios:2016,Yu:2016};
|
||
this model allows any developer
|
||
(core members of a project or external developers)
|
||
to contribute to a public project by submitting \textit{pull-requests}.
|
||
% pr
|
||
Contributors fork (\ie clone) a fascinating repository
|
||
and make their changes on the cloned repository locally
|
||
without disturbing the original repository~\cite{Gousios:2014,Yu:2016}.
|
||
On the base of cloned repository, contributors can
|
||
fix bugs, add new features, improve documents and \etc.
|
||
When their changes are ready to merge back into the main repository,
|
||
they create a pull-request to notify the core team of the project
|
||
to review the submitted changes~\cite{Tsay:2014a,Rigby2015A,yu2014reviewer},
|
||
which is an important software quality assurance.
|
||
|
||
|
||
|
||
% [Also, the cost of users searching the repository (to determine if their problem has
|
||
% been reported) is higher than the cost of creating a new bug report.]
|
||
% problem
|
||
|
||
%!!!!!!!!!
|
||
% 2017-06-26 PM 5:15
|
||
% Work practices and challenges in pull-based development: The contributor’s perspective
|
||
% rq1.1 有对dup pr的相关内容
|
||
% 可以说 有了我们的方法之后,贡献者可以不用耗费大量时间再去审查,
|
||
% 像预测comment usefulness那个一样,在 submittion time就能预测
|
||
% 可以先提交一个pr,然后利用text(后者写上想要修改的文件)通过我们的方法(在线服务)去自动搜索dup
|
||
%!!!!!!!!!
|
||
|
||
Contributing in pull-based model is a parallel and uncoordinated process~\cite{Barr:2012,Gousios:2014b,Gousios:2016}.
|
||
Therefore, duplicate pull-requests may be created
|
||
by different developers to address exactly the same problem,
|
||
especially for the popular projects
|
||
which attracts numerous contributors and receives plenty of pull-requests everyday.
|
||
% importance
|
||
% 危害
|
||
Duplicate pull-requests increase the maintenance cost of GitHub
|
||
and result in the waste of time spent on
|
||
the redundant effort of reviewing each of them sparately.
|
||
%%% 这里可以引seke的那篇
|
||
In the review life-cycle of a pull-request~\cite{Li2017},
|
||
the time from when it is submitted
|
||
to the time it is closed,
|
||
duplication identifying can occur in any stage.
|
||
% 最好有些例子评论表示心情的
|
||
The more late a pull-request is recognized as duplicate of another one,
|
||
the more effort is wasted.
|
||
%对贡献者的
|
||
Furthermore, contributors often continously improve their pull-requests
|
||
driven by the code review feedback~\cite{yu2014reviewer,Li2017},
|
||
and therefore late identification of duplication tend to
|
||
lead the contributors to be more frustrated~\cite{Huang2016Effectiveness} and
|
||
get doubtful about the management team
|
||
when they have paid plenty of effort
|
||
in several round of contribution improvements and code reviews,
|
||
especially if their pull-requests are treated as duplicate of
|
||
the subsequently created ones.
|
||
|
||
The current practice is to count on the code reviewers
|
||
to identify these duplicate pull-requests manually.
|
||
Unfortunately, the number of pull-requests submitted daily, however,
|
||
can be too large to cope with for reviewers of popular projects
|
||
% 不光是新的pr 活跃的pr也算,突出reviewer的工作量大
|
||
(\eg everyday, almost **** new pull-requests need to be handled)~\cite{Gousios:2014,yu2014reviewer}.
|
||
Moreover, it is not realistic for reviewers to keep all the historical pull-requests
|
||
in mind and compare each of them with the newly-submitted one.
|
||
As a result, many duplicate pull-requests cannot be identified in time.
|
||
In spite of so much effort that have been spent on
|
||
the evaluation of pull-requests~\cite{Tsay:2014b,Tsay:2014a,Yu:2015,Thongtanunam:2015,Baysal:2015,jiang:2015},
|
||
very few research is conducted to assist pull-request management.
|
||
This highlights the need for an automated tool
|
||
which can be used to detect duplicate pull-requests at an early stage.
|
||
First of all,
|
||
automatic detection of duplicates can assist reviewers' work
|
||
and prevent them from redundant workload.
|
||
Secondly,
|
||
detecting duplicates early can link the authors of each separate pull-request timely
|
||
and they can be coordinated to work better together on one pull-request rather than
|
||
working independently for redundancy.
|
||
Moreover,
|
||
an interest group can be build among developers who submit duplicate contributions,
|
||
which is based on the fact that they may share the same focus and
|
||
care about the same modules or characters of a project.
|
||
|
||
|
||
|
||
% benefit
|
||
% !!!!
|
||
% 减少审阅者的工作量
|
||
% 更好的协同开发者
|
||
% 建立‘兴趣组织’
|
||
% !!!!
|
||
|
||
% !!!!!! 这篇文章里有关于 软件维护成本的数据
|
||
% An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories
|
||
% !!!!!! 这篇文章里有关于 软件维护成本的数据
|
||
|
||
% [Comparing a new report to already existing reports, in hope of finding a duplicate, is a tedious and error prone process.
|
||
% The current search engine in the DMS is a basic string matcher to which you can pass additional arguments such as time interval and DR id interval
|
||
% ]
|
||
|
||
%解决方案
|
||
In this paper,
|
||
we proposed an approach using natural language text and diff information
|
||
to automatically detect duplicate pull-requests in GitHub.
|
||
Natural language text consists of the title and description of a pull-request
|
||
and diff information indicates the files changes which is patched by this pull-request.
|
||
When a new pull-request arrives,
|
||
our method compares the textual similarity
|
||
and diff similarity between it and other existing pull-requests,
|
||
and then returns a candidate list of the most similar ones.
|
||
%结果
|
||
Based on the test dataset of duplicate pull-requests
|
||
that we collected from three popular projects hosted in GitHub,
|
||
namely Rails, Elasticsearch and Angular.JS.
|
||
we evaluate our approach in terms of recall-rate.
|
||
%!是否该提一下k的取值
|
||
The evaluation result shows that about **\% - **\% of the duplicates can be found
|
||
when we combine textual similarity and line-level diff similarity.
|
||
compared to **\% - **\% using only natural language text
|
||
and **\% - **\% using only diff information.
|
||
|
||
To the best of our knowledge, we are the first to investigate
|
||
how to automatically detect duplicate pull-requests in GitHub.
|
||
The key contributions of this study include the following:
|
||
|
||
\begin{itemize}
|
||
\item
|
||
We propose the problem of detecting duplicate pull-requests in GitHub.
|
||
And we construct a dataset of duplicate pull-requests
|
||
by automatic identification and manual examination.
|
||
|
||
\item
|
||
We propose an approach using natural language text and diff information
|
||
contained in code changes to detect duplicate pull-requests.
|
||
|
||
\item
|
||
The evaluation result that is based on ** popular projects hosted in GitHub
|
||
shows our method can effectively detect **\% - **\% duplicates
|
||
which means it has practical value and can be integrated into the issue tracking systems.
|
||
|
||
\end{itemize}
|
||
|
||
The rest of paper is organized as follows:
|
||
Section~\ref{sec:bg} illustrates the background.
|
||
Section~\ref{sec:approach} presents the approach of our study in detail, and
|
||
Section~\ref{sec:experiment} elaborates the conducted experiments and reports the research result.
|
||
% Section~\ref{sec:Pattern} depicts a preliminary but typical analysis results mined from review comments.
|
||
Threats and related work can be found in Section~\ref{sec:Threats} and Section~\ref{sec:RelatedW}.
|
||
Finally, we draw our conclusion in section~\ref{sec:Concl}.
|
||
|
||
%总结带来的问题
|
||
% dup的种类及问题
|
||
% 2.1 B刚提交就被认为是A的dup
|
||
% 提交者浪费了贡献时间;审阅者浪费了判断的时间
|
||
% 2.2 B提交后在审查过程被认为是A的dup
|
||
% 提交者浪费了贡献时间;审阅者浪费了审阅的时间
|
||
% 2.3 A提交后处于审查状态,B被提交、审查、接受。最后发现A是B的dup
|
||
% 提交者浪费了贡献时间;审阅者浪费了审阅时间;极有可能深深伤害A
|
||
% 对核心团队的能力产生怀疑
|
||
% 对审阅者的
|