duppr/1_introduction.tex


\section{Introduction}
% background
A critical factor of the rapid development and evolution for
many open source software projects in GitHub
is the use of pull-based development model~\cite{Gousios:2014,Gousios:2014b,Gousios:2016,Yu:2016};
this model allows any developer
(core members of a project or external developers)
to contribute to a public project by submitting \textit{pull-requests}.
% pr
Contributors fork (\ie clone)  a fascinating repository
and make their changes on the cloned repository locally
without disturbing the original repository~\cite{Gousios:2014,Yu:2016}.
On the base of cloned repository, contributors can
fix bugs, add new features, improve documents and \etc.
When their changes are ready to merge back into the main repository,
they create a pull-request to notify the core team of the project
to review the submitted changes~\cite{Tsay:2014a,Rigby2015A,yu2014reviewer},
which is an important software quality assurance.


% [Also, the cost of users searching the repository (to determine if their problem has
%  been reported) is higher than the cost of creating a new bug report.]
% problem

%！！！！！！！！！
% 2017-06-26 PM 5:15
% Work practices and challenges in pull-based development: The contributor’s perspective
% rq1.1 有对dup pr的相关内容
% 可以说 有了我们的方法之后，贡献者可以不用耗费大量时间再去审查，
% 像预测comment usefulness那个一样，在 submittion time就能预测
% 可以先提交一个pr，然后利用text（后者写上想要修改的文件）通过我们的方法（在线服务）去自动搜索dup
%！！！！！！！！！

Contributing in pull-based model is a parallel and uncoordinated process~\cite{Barr:2012,Gousios:2014b,Gousios:2016}.
Therefore, duplicate pull-requests may be created
by different developers to address exactly the same problem,
especially for the popular projects
which attracts numerous contributors and receives plenty of pull-requests everyday.
% importance
% 危害
Duplicate pull-requests increase the maintenance cost of GitHub
and result in the waste of time spent on
the redundant effort of reviewing each of them sparately.
%%% 这里可以引seke的那篇
In the review life-cycle of a pull-request~\cite{Li2017},
the time from when it is submitted
to the time it is closed,
duplication identifying can occur in any stage.
% 最好有些例子评论表示心情的
The more late a pull-request is recognized as duplicate of another one,
the more effort is wasted.
%对贡献者的
Furthermore, contributors often continously improve their pull-requests
driven by the code review feedback~\cite{yu2014reviewer,Li2017},
and therefore late identification of duplication tend to
lead the contributors to be more frustrated~\cite{Huang2016Effectiveness} and
get doubtful about the management team
when they have paid plenty of effort
in several round of contribution improvements and code reviews,
especially if their pull-requests are treated as duplicate of
the subsequently created ones.

The current practice is to count on the code reviewers
to identify these duplicate pull-requests manually.
Unfortunately, the number of pull-requests submitted daily, however,
can be too large to cope with for reviewers of popular projects
% 不光是新的pr 活跃的pr也算，突出reviewer的工作量大
(\eg everyday, almost **** new pull-requests need to be handled)~\cite{Gousios:2014,yu2014reviewer}.
Moreover, it is not realistic for reviewers to keep all the historical pull-requests
in mind and compare each of them with the newly-submitted one.
As a result, many duplicate pull-requests cannot be identified in time.
In spite of so much effort that have been spent on
the evaluation of pull-requests~\cite{Tsay:2014b,Tsay:2014a,Yu:2015,Thongtanunam:2015,Baysal:2015,jiang:2015},
very few research is conducted to assist pull-request management.
This highlights the need for an automated tool
which can be used to detect duplicate pull-requests at an early stage.
First of all,
automatic detection of duplicates can assist reviewers' work
and prevent them from redundant workload.
Secondly,
detecting duplicates early can link the authors of each separate pull-request timely
and they can be coordinated to work better together on one pull-request rather than
working independently for redundancy.
Moreover,
an interest group can be build among developers who submit duplicate contributions,
which is based on the fact that they may share the same focus and
care about the same modules or characters of a project.


% benefit
% ！！！！
% 减少审阅者的工作量
% 更好的协同开发者
% 建立‘兴趣组织’
% ！！！！

% ！！！！！！ 这篇文章里有关于 软件维护成本的数据
% An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories
% ！！！！！！ 这篇文章里有关于 软件维护成本的数据

% [Comparing a new report to already existing reports, in hope of finding a duplicate, is a tedious and error prone process.
% The current search engine in the DMS is a basic string matcher to which you can pass additional arguments such as time interval and DR id interval
% ]

%解决方案
In this paper,
we proposed an approach using natural language text and diff information
to automatically detect duplicate pull-requests in GitHub.
Natural language text consists of the title and description of a pull-request
and diff information indicates the files changes which is patched by this pull-request.
When a new pull-request arrives,
our method compares the textual similarity
and diff similarity between it and other existing pull-requests,
and then returns a candidate list of the most similar ones.
%结果
Based on the test dataset of duplicate pull-requests
that we collected from three popular projects hosted in GitHub,
namely Rails, Elasticsearch and Angular.JS.
we evaluate our approach in terms of recall-rate.
%!是否该提一下k的取值
The evaluation result shows that about **\% - **\% of the duplicates can be found
when we combine textual similarity and line-level diff similarity.
compared to **\% - **\% using only natural language text
and **\% - **\% using only diff information.

To the best of our knowledge, we are the first to investigate
how to automatically detect duplicate pull-requests in GitHub.
The key contributions of this study include the following:

\begin{itemize}
	\item
	We propose the problem of detecting duplicate pull-requests in GitHub.
	And we construct a dataset of duplicate pull-requests
	by automatic identification and manual examination.

	\item
	We propose an approach using natural language text and diff information
	contained in code changes to detect duplicate pull-requests.

	\item
	The evaluation result that is based on ** popular projects hosted in GitHub
	shows our method can effectively detect **\% - **\% duplicates
	which means it has practical value and can be integrated into the issue tracking systems.

\end{itemize}

The rest of paper is organized as follows:
Section~\ref{sec:bg} illustrates the background.
Section~\ref{sec:approach} presents the approach of our study in detail, and
Section~\ref{sec:experiment} elaborates the conducted experiments and reports the research result.
% Section~\ref{sec:Pattern} depicts a preliminary but typical analysis results mined from review comments.
Threats and related work can be found in Section~\ref{sec:Threats} and Section~\ref{sec:RelatedW}.
Finally, we draw our conclusion in section~\ref{sec:Concl}.

%总结带来的问题
% dup的种类及问题
% 	2.1 B刚提交就被认为是A的dup
%		提交者浪费了贡献时间；审阅者浪费了判断的时间
% 	2.2 B提交后在审查过程被认为是A的dup
%		提交者浪费了贡献时间；审阅者浪费了审阅的时间
% 	2.3 A提交后处于审查状态，B被提交、审查、接受。最后发现A是B的dup
%		提交者浪费了贡献时间；审阅者浪费了审阅时间；极有可能深深伤害A
%		对核心团队的能力产生怀疑
	% 对审阅者的