duppr_dataset/1-intro.tex

%!TEX root = main.tex

\section{Introduction}

% 在GitHub 基于PR的贡献流程
In GiHub,
the pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
lowers the contribution entry for community developers and
prompts the development and evolution of numerous open source software projects.
Any contributor can fork (\ie clone) a repository
and edit the forked repository locally without disturbing the original repository.
After finishing their local work (\eg fixing bugs or proposing features),
contributors package the code changes into a new Pull-Request (PR)
and submit it to the original repository.
And then the core members of the project and community users will
launch the process of code review~\cite{Tsay:2014a,Gousios:2016} to
detect potential defects contained in the submitted PR and discuss how to improve its quality.
Finally, the PR which have went through several rounds of rigorous evaluations
will be merged or rejected depending on its eventual quality by an integrator of the original repository.

% 重复PR的产生 及 重复pr的危害
However, due to the parallel and distributed nature of pull-based development model,
more than one contributors would submit PRs to achieve
a similar objective (\ie duplicate PRs~\cite{Gousios:2014b}).
%duplicate PRs~\cite{Gousios:2014b} may be submitted by more than one contributors
%to achieve the same objective.
Especially for the popular projects which attract thousands of volunteers and
continuously receive incoming PRs~\cite{Yu:2015,Thongtanunam:2015},
it is hard to appropriately coordinate contributors' activities,
because most of them work distributively and
tend to lack information of others progress.
% 危害-平台
Duplicate PRs increase the maintenance cost of GitHub
and result in the waste of time spent on
the redundant effort of evaluating each of them separately~\cite{Gousios:2014,Gousios:2016}.
Moreover, contributors may iteratively update and improve their PRs
in several rounds of code reviews~\cite{Yu:2015}
driven by the feedbacks provided by reviewers.
Therefore, the more late the duplicate relations between PRs are identified,
the more efforts of contributors and reviewers may be wasted.
% 危害-贡献者
Furthermore, improper management of duplicates may also lead the contributors
to be more frustrated~\cite{Huang2016Effectiveness} and get doubtful about the core team.

% late identification may also lead the contributor to be more frustrated~\cite{Huang2016Effectiveness}
% and get doubtful about the core team when s/he have paid plenty of effort
% in several rounds of contribution improvements and code reviews,
% especially when her/his PR is finally rejected
% while a duplicate PR submitted late is preferred.

Although several research has been conducted on analyzing
the popularity~\cite{Gousios:2014}, challenges~\cite{Gousios:2014b,Gousios:2016} and
evaluations~\cite{Yu:2015,yu2015wait} of PRs,
the problem of duplicate PRs is left not well studied.
More research,
including empirical studies on the cause, outcome, challenge, and even influencing factor of duplicate PRs and
automatic tool development used to help reviewers to detect and choose duplicates,
need to be conducted to better understand and solve the issues introduced by duplicate PRs.
To facilitate the further studies,
we constructed a large dataset of historical duplicate PRs
(called \textit{DupPR})
extracted from 26 open source projects in GitHub.
Each pair of duplicate PRs in \textit{DupPR}
has been manually verified after an automatic identification process,
which would guarantee the quality of this dataset.
%The dataset and the source code used to recreate it is available
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
%Based on this dataset, the following interesting research can be feasible.
We make the dataset and the source code available online,~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
in hope it will foster more interest in the following studies.
%which enables the researchers

\begin{itemize}

	% \item Analyzing how much redundant effort would be wasted by duplicate PRs.
	% This would give researchers a clear idea of the issues that duplicate PRs have introduced.
	\item Analyzing how much redundant effort would be wasted by duplicate PRs.
	This would give researchers a straightforward impression
	about how duplicate PRs negatively affect software development process.

	\item Investigating how reviewers make dicisions among similar contributions.
	It is necessary to build automatic tools that make more targeted comparisons between PRs
	and assist reviewers in managing duplicates.

	\item Training and evaluating the intelligent models for detecting duplicate PRs.
	Detecting duplicates at submission time can avoid redundant effort spent in quality evaluation.

	\item Exploring the factors that affect the occurrence probability of duplicate PRs.
	%Exploring what kind of contributors are more likely to submit duplicate PRs.
	This makes it possible to recognize inefficient collaborative patterns
	that are more likely to generate duplicate contributions,
	and hence core members can propose corresponding strategies to avoid them.

\end{itemize}

\begin{figure*}[!htbp]
 	\centering
 	\includegraphics[width=0.9\textwidth]{figs/data_collection.pdf}
 	\caption{Process of collecting duplicate Pull-requests}
	\label{fig:data_get}
\end{figure*}

% The remainder of this paper is organized as follows.
% First, we explain the methodology to construct our dataset.
% Next, we present a description of the dataset.
% Then, we demonstrate some preliminary applications of the dataset.
% Finally, we conclude the paper with the limitations of the dataset and the future work.

% 研究动机 。。。。
% 1. 使用层面，可以了解为什么产生重复贡献
% 2. cost 管理成本
% 3. 相关工作很多，但是，需要深层次的dataset，额外的深加工的，