113 lines
5.8 KiB
TeX
113 lines
5.8 KiB
TeX
%!TEX root = main.tex
|
||
|
||
\section{Introduction}
|
||
|
||
% 在GitHub 基于PR的贡献流程
|
||
In GiHub,
|
||
the pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
|
||
lowers the contribution entry for community developers and
|
||
prompts the development and evolution of numerous open source software projects.
|
||
Any contributor can fork (\ie clone) a repository
|
||
and edit the forked repository locally without disturbing the original repository.
|
||
After finishing their local work (\eg fixing bugs or proposing features),
|
||
contributors package the code changes into a new Pull-Request (PR)
|
||
and submit it to the original repository.
|
||
And then the core members of the project and community users will
|
||
launch the process of code review~\cite{Tsay:2014a,Gousios:2016} to
|
||
detect potential defects contained in the submitted PR and discuss how to improve its quality.
|
||
Finally, the PR which have went through several rounds of rigorous evaluations
|
||
will be merged or rejected depending on its eventual quality by an integrator of the original repository.
|
||
|
||
% 重复PR的产生 及 重复pr的危害
|
||
However, due to the parallel and distributed nature of pull-based development model,
|
||
more than one contributors would submit PRs to achieve
|
||
a similar objective (\ie duplicate PRs~\cite{Gousios:2014b}).
|
||
%duplicate PRs~\cite{Gousios:2014b} may be submitted by more than one contributors
|
||
%to achieve the same objective.
|
||
Especially for the popular projects which attract thousands of volunteers and
|
||
continuously receive incoming PRs~\cite{Yu:2015,Thongtanunam:2015},
|
||
it is hard to appropriately coordinate contributors' activities,
|
||
because most of them work distributively and
|
||
tend to lack information of others progress.
|
||
% 危害-平台
|
||
Duplicate PRs increase the maintenance cost of GitHub
|
||
and result in the waste of time spent on
|
||
the redundant effort of evaluating each of them separately~\cite{Gousios:2014,Gousios:2016}.
|
||
Moreover, contributors may iteratively update and improve their PRs
|
||
in several rounds of code reviews~\cite{Yu:2015}
|
||
driven by the feedbacks provided by reviewers.
|
||
Therefore, the more late the duplicate relations between PRs are identified,
|
||
the more efforts of contributors and reviewers may be wasted.
|
||
% 危害-贡献者
|
||
Furthermore, improper management of duplicates may also lead the contributors
|
||
to be more frustrated~\cite{Huang2016Effectiveness} and get doubtful about the core team.
|
||
|
||
% late identification may also lead the contributor to be more frustrated~\cite{Huang2016Effectiveness}
|
||
% and get doubtful about the core team when s/he have paid plenty of effort
|
||
% in several rounds of contribution improvements and code reviews,
|
||
% especially when her/his PR is finally rejected
|
||
% while a duplicate PR submitted late is preferred.
|
||
|
||
Although several research has been conducted on analyzing
|
||
the popularity~\cite{Gousios:2014}, challenges~\cite{Gousios:2014b,Gousios:2016} and
|
||
evaluations~\cite{Yu:2015,yu2015wait} of PRs,
|
||
the problem of duplicate PRs is left not well studied.
|
||
More research,
|
||
including empirical studies on the cause, outcome, challenge, and even influencing factor of duplicate PRs and
|
||
automatic tool development used to help reviewers to detect and choose duplicates,
|
||
need to be conducted to better understand and solve the issues introduced by duplicate PRs.
|
||
To facilitate the further studies,
|
||
we constructed a large dataset of historical duplicate PRs
|
||
(called \textit{DupPR})
|
||
extracted from 26 open source projects in GitHub.
|
||
Each pair of duplicate PRs in \textit{DupPR}
|
||
has been manually verified after an automatic identification process,
|
||
which would guarantee the quality of this dataset.
|
||
%The dataset and the source code used to recreate it is available
|
||
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||
%Based on this dataset, the following interesting research can be feasible.
|
||
We make the dataset and the source code available online,~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||
in hope it will foster more interest in the following studies.
|
||
%which enables the researchers
|
||
|
||
\begin{itemize}
|
||
|
||
% \item Analyzing how much redundant effort would be wasted by duplicate PRs.
|
||
% This would give researchers a clear idea of the issues that duplicate PRs have introduced.
|
||
\item Analyzing how much redundant effort would be wasted by duplicate PRs.
|
||
This would give researchers a straightforward impression
|
||
about how duplicate PRs negatively affect software development process.
|
||
|
||
\item Investigating how reviewers make dicisions among similar contributions.
|
||
It is necessary to build automatic tools that make more targeted comparisons between PRs
|
||
and assist reviewers in managing duplicates.
|
||
|
||
\item Training and evaluating the intelligent models for detecting duplicate PRs.
|
||
Detecting duplicates at submission time can avoid redundant effort spent in quality evaluation.
|
||
|
||
\item Exploring the factors that affect the occurrence probability of duplicate PRs.
|
||
%Exploring what kind of contributors are more likely to submit duplicate PRs.
|
||
This makes it possible to recognize inefficient collaborative patterns
|
||
that are more likely to generate duplicate contributions,
|
||
and hence core members can propose corresponding strategies to avoid them.
|
||
|
||
\end{itemize}
|
||
|
||
\begin{figure*}[!htbp]
|
||
\centering
|
||
\includegraphics[width=0.9\textwidth]{figs/data_collection.pdf}
|
||
\caption{Process of collecting duplicate Pull-requests}
|
||
\label{fig:data_get}
|
||
\end{figure*}
|
||
|
||
% The remainder of this paper is organized as follows.
|
||
% First, we explain the methodology to construct our dataset.
|
||
% Next, we present a description of the dataset.
|
||
% Then, we demonstrate some preliminary applications of the dataset.
|
||
% Finally, we conclude the paper with the limitations of the dataset and the future work.
|
||
|
||
% 研究动机 。。。。
|
||
% 1. 使用层面,可以了解为什么产生重复贡献
|
||
% 2. cost 管理成本
|
||
% 3. 相关工作很多,但是,需要深层次的dataset,额外的深加工的,
|