duppr_analysis/1_introduction.tex

\section{Introduction}
\label{sec:intro}
The success of many community-based Open Source Software (OSS) projects
relies heavily on a large number of volunteer developers~\cite{Nakakoji2002Evolution, Crowston2006Core, Steinmacher2016, stol2014two},
who are geographically distributed and collaborate online with others from all over the world~\cite{Gutwin2004Group, Lin2017Developer}.
Compared to the traditional email-based contribution submission~\cite{bird2007dt},
the pull-based model~\cite{Gousios:2014} on modern collaborative coding platforms (\eg GitHub~\cite{github} and GitLab~\cite{gitlab}) supports a more efficient collaboration process~\cite{Zhu2016Effe},
by coupling code repository with issue tracking, review discussion and continuous integration, delivery and deployment~\cite{shahin2017ct, zhang2018one}.
Consequently, an increasing number of OSS projects are adopting the synthesized pull-based mechanism,
which helps them improve their productivity~\cite{vas15qua} and attract more contributors~\cite{yu16det}.

However,
while the increased number of contributors in large-scale software development leads to more innovations (\eg unique ideas and inspiring solutions),
it also results in severe coordination challenges~\cite{west2006cg}.
Currently,
one of the typical coordination problems in pull-based development
is duplicate work~\cite{zhou2019fork, steinmacher2018almost},
due to the asynchronous nature of loosely self-organized
collaboration~\cite{Steinmacher2016,bird2008latent} in OSS communities.
On the one hand,
it is unreasonable for a core team to arrange and assign
external contributors to carry out every specific task
under the open source model~\cite{lakhani2004open,lakhani2003hackers}
(\ie external contributors are mainly motivated by interest and intellectual
stimulation derived from writing code,
rather than requirements or assignments).
On the other hand,
it is impractical to expect external developers
(especially newcomers and occasional contributors)
to deeply  understand the development progress of the OSS projects~\cite{lee2017understanding,steinmacher2015social,gousios2016work}
before submitting patches.
Thus, OSS developers involved in the pull-based model submit \textit{duplicate pull requests} (akin to duplicate bug reports~\cite{bg2008dp}), even though they collaborate on modern social coding platforms (\eg GitHub)
with relatively transparent~\cite{Tsa14Let, Dabbish2012Social} and
centralized~\cite{Gousios:2014} working environments.
{\color{hltext}
The recent study by Zhou \etal~\cite{zhou2019fork} has showed that full or partial duplication is pervasive in OSS projects and particularly severe in some large projects (max 51\% of duplicates).
}

Notably, a large part of duplicates are not submitted intentionally
to provide different or better solutions.
Instead, contributors submit duplicates unintentionally
because of misinformation and unawareness of a project's status~\cite{gousios2016work,zhou2019fork}.
%because their submitters failed to build an understanding of the project status~\cite{gousios2016work,zhou2019fork}
%%% fixed %%% the reasons provided here come from nowhere, particularly "developers inappropriate work patterns"
In practice, duplicate pull requests may cause substantial friction
among external contributors and core integrators;
these duplicates are an common reason
for direct rejection~\cite{gousios2016work, steinmacher2018almost}
without any chance for improvement,
which frustrates contributors and discourages them from contributing further.
Moreover, redundant work is more likely to
increase costs during the evaluation and maintenance stages
assembled with DevOps tools compared to traditional development models.
For example,
continuous integration tools (Travis-CI~\cite{vas15qua, widder2019conceptual}) automatically merge every newly received pull request into a testing branch, build the project and run existing test suites,
so computing resources are wasted if integrators do not discover the duplicates and stop the automation process in time.
{\color{hltext}
Therefore,
%both project integrators and contributors hope
avoiding duplicate pull requests is becoming a realistic demand for OSS management, \eg
scikit-learn provides a special note in the contributing guide
\textit{``To avoid duplicating work, it is highly advised that you search through the issue tracker and the PR list. If in doubt about duplicated work, or if you want to work on a non-trivial feature, it’s recommended to first open an issue in the issue tracker to get some feedbacks from core developers.''}~\cite{intor-avoid}

%and \textit{``Oops! Sorry, did not mean to double up''}~\cite{cntor-avoid}.
}


Existing work has highlighted the problems of duplicate pull requests~\cite{Gousios:2014,steinmacher2018almost,zhou2019fork} (\eg inefficiency and redundant development), and
proposed ways to detect duplicates~\cite{li2017detecting,ren2019identifying}.
However, the nature of duplicate pull requests, particularly the fine-grained resources that are wasted by the duplicates, the context in which duplicates occur,
and the features that distinguish merged duplicates from their counterparts,
have rarely been investigated.
Understanding these questions would help mitigate the threats brought by duplicate pull requests
and improve software productivity.

Therefore, we bridge the gap on the investigation of duplicate pull requests in this study. %in this paper, we report an empirical study on duplicate pull requests.
We extend our previously collected duplicate pull request dataset~\cite{dup2018} by adding change details, review history, and integrators' choice.
Based on the dataset,
we analyze the redundancies of duplicate pull requests in the development and evaluation stages,
explore the context in which duplicates occur
and examine the difference between duplicate and non-duplicate pull requests.
We further investigate the reasons why among a group of duplicates,
a pull request is more likely to be accepted by an integrator.
Finally, we propose actionable suggestions for OSS communities.

The main contributions of this paper are summarized as follows:
\begin{itemize}
\item It presents empirical evidence on the impact of duplicate pull requests on development effort and review process.
The findings will help software engineering researchers and practitioners better understand the threats of duplicate pull requests.
%(RQ1)

\item It reveals the context of duplicate pull requests, highlighting the inappropriateness of OSS contributors' work patterns and the shortcomings of the current OSS collaboration environment.
These findings can guide developers to avoid redundant effort on the same task. %(RQ2-1)

\item It provides quantitative insights into the difference between duplicate and non-duplicate pull requests,
which can offer useful guidance for automatic duplicate detection.
%(RQ2-2)

\item It summarizes the characteristics of the accepted pull requests compared to those of their duplicate counterparts,
which will provide actionable suggestions for inexperienced integrators in duplicate selection.
%(RQ3)
\end{itemize}

The rest of the paper is organized as follows:
Section~\ref{sec:bg} introduces the background and research questions.
Section~\ref{sec:dsrq} presents the dataset used in this study.
Sections~\ref{rq1}, \ref{rq2} and \ref{rq3} report the experimental results and findings.
Section~\ref{sec:ds} proposes actionable suggestions and implications for OSS practitioners.
Section~\ref{sec:lm} discusses the threats to the validity of the study.
Finally, we draw conclusions in Section~\ref{sec:cc}.