duppr_analysis/4_rq2.tex

\section{Context where duplicate pull requests are produced}
\label{rq2}
Prior work has found the paradoxical phenomenon that
despite the effort devoted by contributors to avoid similar work,
many duplicate pull requests are submitted~\cite{gousios2016work}.
Thus, we further investigate the context in which duplicates occur and the factors leading pull requests to be duplicates.

First,
we investigate the context of pull requests when duplicate occurs, as described in Section~\ref{ss:contextanalysis}.
In particular, we examine the lifecycle of pull requests and
discover three types of sequential relationship
between two duplicate pull requests.
For each relationship,
we investigate whether contributors' work patterns and their collaborating environment
have any flaw that may produce duplicate pull requests.

% {\color{hltext}
Second,
we investigate the differences between duplicate and non-duplicate pull requests, as described in Section~\ref{ss:differenceanalysis}.
% in order to extract useful information that can provide insights into duplicates detection.
We identify a set of metrics from prior studies to characterize pull requests.
We then conduct comparative exploration and regression analysis to
examine the characteristics that can distinguish duplicate with non-duplicate pull requests.
% }


\subsection{The context of duplicate pull requests}\label{ss:contextanalysis}
The entire lifecycle of a pull request consists of two stages: \textit{local creation} and \textit{online evaluation}.
In the local creation stage,
%%应该是相对于main repo而言是offline?
%%% >> 这里的offline指的是贡献者线下在本地进行编码，online指的是在线对pr进行审查。
contributors edit the files and commit changes to their local repositories.
%which are all local activities. %%%我的point是，称“edit files and commit changes to local repo”为offline有点诡异
%%% >>> 现在改成了 local creation是否会一些?
In the online evaluation stage,
contributors submit a pull request to notify the integrators of the original repository
to review the committed changes online.
These two stages are separated by the submission time of a pull request.
For each pair of pull requests,
there are only three types of sequential relationships in logic when comparing the order in which they enter each stage.
We manually analyze contributors' work patterns and the collaborating environment to explore the possible context of duplicates in different relationships.
In the following sections,
we first elaborate on the three types of sequential relationships and then present the identified context of duplicate pull requests demonstrated by statistics and representative cases.

%%%fixed %%%你在这里需要对本节各小节有个阐述, 不然读者莫名其妙搞不懂逻辑, 例如5.1.2和5.1.3的逻辑是什么?

%\subsubsection{Definition of temporal relation.}
\subsubsection{Types of sequential relationship}
\mathchardef\mhyphen="2D
We first introduce two critical time points, $T\mhyphen Creation$ and $T\mhyphen Evaluation$, in the lifecycle of pull requests;
these time points are defined as follows.

\begin{itemize}
	\item
	$T\mhyphen Creation$ indicates the start of pull request local creation.
	It is impossible to know the exact time at which a developer begins to work
	since developers only \texttt{git commit} when their work is finished
	rather than when the work is launched.
	However, we can still get an approximate start time.
	We set $T\mhyphen Creation$ as the \texttt{author\_date} of the first commit packaged in a  pull request,
	which is the earliest timestamp contained in the commit history of a pull request.

	\item
	$T\mhyphen Evaluation$
	%%what means B?
	%%% >>>B没有特殊含义，文章里涉及三个时间点，就按照TA,TB,TC这样命名了，现在分别重命名为T-Creation, T-Evaluation, T-Exposure，应该会更容易理解一些。
	indicates the start of pull request online evaluation, \ie the submission time of a pull request.
	This value is the \texttt{created\_at} value of a pull request.

\end{itemize}

For a pair of duplicate pull requests $<\!mst\_pr, dup\_pr\!>$ ($mst\_pr$ is submitted earlier than $dup\_pr$),
we suppose that the contributor of $mst\_pr$ begins to work at $T\mhyphen Creation_{mst}$
and submits $mst\_pr$ at $T\mhyphen Evaluation_{mst}$,
and the contributor of $dup\_pr$ starts to work at $T\mhyphen Creation_{dup}$ and submits $dup\_pr$ at $T\mhyphen Evaluation_{dup}$.
We discover three possible sequential relationships between  $mst\_pr$ and  $dup\_pr$, as shown in Figure~\ref{fig:temp_rela}.

\begin{figure}[h]
	\centering
	\includegraphics[width=0.5\textwidth]{resources/temp_rela.pdf}
	\caption{The sequential relationship between two pull requests $mst\_pr$ and $dup\_pr$.}
   \label{fig:temp_rela}
\end{figure}

\begin{itemize}
	\item \textit{Exclusive}.  $T\mhyphen Creation_{mst} < T\mhyphen Evaluation_{mst} < T\mhyphen Creation_{dup} < T\mhyphen Evaluation_{dup}$,
			\ie the author of $dup\_pr$ begins to work after the author of $mst\_pr$ has already finished
			the local work and submitted the pull request.
	\item \textit{Overlapping}. $T\mhyphen Creation_{mst}  < T\mhyphen Creation_{dup} \leq T\mhyphen Evaluation_{mst} < T\mhyphen Evaluation_{dup}$,
			\ie the author of $dup\_pr$ starts working after  the author of $mst\_pr$ starts working and before the author of $mst\_pr$ finishes working.
	\item \textit{Inclusive}. $T\mhyphen Creation_{dup} \leq T\mhyphen Creation_{mst}  < T\mhyphen Evaluation_{mst} < T\mhyphen Evaluation_{dup}$,
			\ie although the author of $dup\_pr$ starts to work earlier than the author of $mst\_pr$ does,
			s/he submits the pull request later. %behindhand anyway.

\end{itemize}

In the above,
we only discuss the common cases where
developers first commit changed code and then issue a pull request.
However,
some developers may first issue an `empty' pull request to obtain early feedback
and then start to work and push commits to the pull request.
The two different workflows can be called
the \textit{commit-then-pullrequest} (\textit{CTP}) model and the  \textit{pullrequest-then-commit}  (\textit{PTC}) model,
respectively.
Indeed,
the \textit{PTC} model is encouraged in some projects for introducing new features,
which can prevent contributors from wasting time on undesired features.

Because developers adopting the \textit{PTC} model
only report their ideas rather than submitting any concrete code when they submit the pull requests,
% their pull requests do not consist of the actual offline creation stage.
their pull requests skip the local creation stage and enter the review stage.
%%% fixed %%% pull request doesn't consist ...stage 这个显然语义不对
%Instead of revising the $TA$ and $TB$ of \textit{PTC}  pull requests,
The above definition of a sequential relationship does not apply to \textit{PTC} pull requests.
To demonstrate this kind of situation,
%we define a new kind of time point $TC$ which indicates the exposure time of developers' ideas.
we define $T\mhyphen Exposure$ to indicate the exposure time of developer's ideas for \textit{PTC} pull requests.
%The value of
$T\mhyphen Exposure$ is also set to be the \texttt{created\_at} of pull requests.
%And then
% Next we discuss the sequential relationship between two pull requests in the three cases involving the \textit{PTC}  pull requests as shown in Table~\ref{tab:model_raltion}.
Next,
we discuss the sequential relationship between two pull requests,
of which at least one is a \textit{PTC}  pull request.
%%% fixed %%%这里交代不明，3个cases从何而来，为什么要看这三个cases，为了啥目的而看?

\begin{table}[h]
	\centering
	\caption{Three cases involving the \textit{pullrequest-then-commit} pull requests}
	\begin{tabularx}{0.49\textwidth}{@{}X Y Y @{}}
		\toprule
		\textbf{Case} &\textbf{\textit{mst\_pr}} &\textbf{\textit{dup\_pr}} \\ \midrule
		1&\textit{CTP} & \textit{PTC}\\
		2&\textit{PTC} & \textit{PTC}\\
		3&\textit{PTC} & \textit{CTP}\\

		\bottomrule
	\end{tabularx}
	\label{tab:model_raltion}
\end{table}

As shown in Table~\ref{tab:model_raltion},
there are three cases involving \textit{PTC}  pull requests.
In \textit{case 1} ($T\mhyphen Creation_{mst} < T\mhyphen Evaluation_{mst} <  T\mhyphen Exposure_{dup}$)
and \textit{case 2} ($T\mhyphen Exposure_{mst} < T\mhyphen Exposure_{dup}$),
the local work or the idea of $mst\_pr$ has been exposed to the community
before $dup\_pr$ is submitted.
Therefore,
we treat the sequential relationship between a pair of pull requests in these two cases
as exclusive.
In \textit{case 3},
there are two possible situations:
a) $T\mhyphen Exposure_{mst} < T\mhyphen Creation_{dup} < T\mhyphen Evaluation_{dup}$ means that the idea of $mst\_pr$ has been exposed before
the author of $dup\_pr$ starts to work
and their sequential relationship can be seen as exclusive.
b) $T\mhyphen Creation_{dup} < T\mhyphen Exposure_{mst} < T\mhyphen Evaluation_{dup}$ means the author of $dup\_pr$ starts to work before
the idea of $mst\_pr$ is exposed and finishes the work after that;
therefore, their sequential relationship can be seen as inclusive.


Finally,
to explore the distribution of the three types of relationships in our dataset,
we convert each tuple of duplicate pull requests ($<\!dup_1,\ dup_2,\ dup_3,\ ..., dup_n\!>$)
to pairs: ($dup_i$, $dup_j$),
where $1\leq i,j \leq n$ and $i<j$.
Table~\ref{tab:temporal_seq} shows the distribution,
and we can see that the majority of duplicate pull request pairs have an exclusive sequential relationship (\ie the duplicate contribution begins to work after the original pull request have been visible), which suggests that it is still a great challenge of awareness and transparency~\cite{Dabbish2012Social} during collaboration process.
%%%what does it suggest?

\begin{table}[h]
	{\color{hltext}
	\centering
	\caption{{\color{hltext}The distribution of different sequential relationships in the dataset}}
	\begin{tabularx}{0.49\textwidth}{@{\hspace{0em}}X X X X@{\hspace{0em}}}
		\toprule
		 & \textbf{Exclusive} & \textbf{Overlapping} &\textbf{Inclusive} \\ \midrule

		Count & 1,924 & 17 & 81 \\
		\bottomrule
	\end{tabularx}
	\label{tab:temporal_seq}
	}
\end{table}

\subsubsection{Context of exclusive duplicate pull requests}


\vspace{0.5em}
\noindent \textbf{Not searching for existing work.}
For a pair of exclusive duplicate pull requests,
there is a time window during which the author of $dup\_pr$ had a chance
to figure out the existence of $mst\_pr$.
However,
some developers did not check that someone
has already submitted duplicate pull requests.
{\color{hltext}
For example,
quite a few contributors said that they did not search the pull request list for similar work (\eg \textit{``Oh, Sorry I did not search for a previous PR before submitting a PR''} and \textit{``Ah should have searched first, thanks''}).
In some cases,
developers' search was not complete because they only searched the open pull requests and missed the closed ones
(\eg \textit{``Ah, my bad. I thought I searched, but I must have only been looking at open''}).
}
The survey conducted by Gousios~\cite{gousios2016work} also showed that
45\% contributors occasionally or never check whether similar pull requests already exist before coding.


\vspace{0.5em}
\noindent \textbf{Diversity of natural language usages.}
Some developers tried to search for existing duplicates,
but they ultimately found nothing
(\eg \textit{``Sorry, I searched before pushing but did not find your PR...''}).
One challenge is the diversity of natural language usages.
For a pair of duplicate pull requests,
we compute the common words ratio based on their titles,
which is calculated by the following formula.

\begin{equation}
	CWR(mst\_pr, dup\_pr) = \frac{| WS_{mst\_pr} \cap WS_{dup\_pr}|} {|WS_{mst\_pr}|}
\end{equation}

$WS_{mst\_pr}$ and $WS_{dup\_pr}$ represent the set of words extracted from
the titles of $mst\_pr$ and $dup\_pr$, respectively,
after necessary preprocessing like tokenizing, stemming~\cite{miller1995wordnet},
and removing common stop words.
Figure~\ref{fig:cwr} shows the statistics of common words ratio.
Approximately half of them have a value less than 0.25,
which means a pair of duplicates tend to share a small proportion of common words.
That is to say,
a keyword-based query cannot always
successfully detect
existing duplicate pull requests
due to the difference in wording for the same concept.
For example,
the title of \textit{angular/angular.js/\#4916} is
\textit{``Fixed step12 correct file reference''}
and the title of \textit{angular/angular.js/\#4860} is
\textit{``Changed from phone-list.html to index.html''}.
We can see that the two titles share no common word
although the two pull request have edited the same file and changed the code in the same line.

\begin{figure}[h]
	\centering
	\includegraphics[width=0.45\textwidth]{resources/cwr.png}
	\caption{The statistics of common words ratio between duplicates.}
   \label{fig:cwr}
\end{figure}


\vspace{0.5em}
\noindent \textbf{Disappointing search functionality in GitHub.}
Another challenge that can cause ineffective searching for duplicates is that
GitHub's search functionality might be disappointing in
retrieving similar pull requests even though they share some common words.
For example,
the titles of \textit{angular/angular.js/\#5063} and \textit{angular/angular.js/\#7846} are
\textit{``fix(copy): preserve prototype chain when copying object''} and \textit{``Use source object prototype in object copy''}, respectively,
which share three common critical words, \emph{i.e.,} \textit{prototype, copy,} and \textit{object}.
For testing purpose,
we launch a query in GitHub using the keywords \textit{prototype copy object}.
%Finally,
We retrieve 9 pages (each page containing 10 items) of issues and pull requests in the search results
and we finally find \textit{angular/angular.js/\#5063} in the 7th page.
It is unlikely that
developers have the willingness and patience to
look through 7 pages of search results to figure out the existence of duplicates,
since people tend to focus on the first few pages~\cite{KammererThe}.
Perhaps that is exactly what leads the author of \textit{angular/angular.js/\#7846} to submit a duplicate,
although he blamed the failed retrieval on himself
(\textit{``apologies, I did search before posting (forget the search term I used) but clearly my search was bad... Thanks for finding the dup''}).


\vspace{0.5em}
\noindent \textbf{Large searching space.}
Developers might manually look through the issue tracker to search for duplicates
rather than retrieving through a query interface.
Sometimes it is hard to find out the existing duplicates due to large searching space.
The statistics of exclusive intervals between duplicates is listed in Table~\ref{tab:exc_inter}.
On average,
the local work of $dup\_pr$ is started approximately 1,400 hours (\ie more than 58 days)
after $mst\_pr$ has been submitted.
During that long period,
many new pull requests have been submitted in popular projects.
For example,
307 pull requests were submitted
% during the 42 days
between
\textit{pandas-dev/pandas/\#9350} and \textit{pandas-dev/pandas/\#10074}.
These pull requests can occupy more than 10 pages in the issue tracker,
which makes it rather hard and ineffective to review historical pull requests page by page, as a developer stated
\textit{``...This is a dup of that PR. I should have looked harder as I didn't see that one when I created this one...''}.


\begin{table}[h]
	{\color{hltext}
	\centering
	\caption{{\color{hltext}The statistics of exclusive intervals (in hour)}}
	\begin{tabularx}{0.49\textwidth}{@{}l@{\hspace{0.5em}}Y@{\hspace{0.2em}}Y@{\hspace{0.2em}}Y Y Y Y@{}}
		\toprule
		&\textbf{Min} &\textbf{25\%} &\textbf{Median} &\textbf{75\%} &\textbf{Max}s  &\textbf{Mean} \\ \midrule
		Interval & 0.004 & 23.81 & 212.59 & 1276.82 & 29377.12 & 1397.59   \\

		\bottomrule
	\end{tabularx}
	\label{tab:exc_inter}
	}
\end{table}


{\color{hltext}

\vspace{0.5em}
\noindent \textbf{Overlooking linked pull requests.}
When developers submit a pull request to solve an exiting GitHub issue,
they can build a link between the pull request and the issue by referencing the issue in the pull request description.
The cross-reference is also displayed
in the discussion timeline of the issue.
Links not only allow pull request reviewers to find out the issue to be solved by a pull request
but also help developers who are concerned with an issue to discover which pull requests have been submitted for that issue.
In some cases,
contributors did not examine or did not notice the linked pull requests to an issue
(\eg \textit{``Uhm yeah, didn't spot the reference in \#21967''}
and \textit{``Argh, didn't see it in the original issue. Need more coffee I guess''})
to make sure that there is no ongoing work on that issue,
and consequently submit a duplicate pull request.


\vspace{0.5em}
\noindent \textbf{Lack of links.}
If a developer does not link her/his pull request to the associated issue,
other developers might consider that no patch has been pushed to fix that issue.
As a result,
another interested developer might submit a duplicate pull request.
For example,
a developer \textit{Dev2} submitted a pull request
\textit{facebook/react/\#6135} to address
the issue \textit{facebook/react/\#6114}.
However,
\textit{Dev2} was told that
a duplicate pull request \textit{facebook/react/\#6121} was already submitted by \textit{Dev1} before him.
The conversation between \textit{Dev1} and \textit{Dev2}
(\textit{Dev2} said \textit{``I'm glad to hear that. But please link your future PRs to the issues''}, and \textit{Dev1} replied \textit{``Yeah I will that's on me!''})
revealed that the lack of the link can account for the duplication.


\vspace{0.5em}
\noindent \textbf{Missing notifications.}
If developers have watched~\cite{Dabbish2012Social,Sheoran2014Understanding} a project,
they receive notifications about events that occur in the project,
\eg new commits and pull requests.
The notifications are displayed in developers' GitHub dashboard and,
if configured, sent to developers via email.
However,
developers might miss the notification about the submission of a similar pull request due to information overload~\cite{Dabbish2013Leveraging} or some other reasons,
and eventually submit a duplicate.
For example,
\textit{kubernetes/kubernetes/\#43902}
was duplicate of
\textit{kubernetes/kubernetes/\#43871}
because the author of \textit{kubernetes/kubernetes/\#43902} \textit{``missed the mail for the PR it seems :-/''}.

}


\subsubsection{Context of overlapping and inclusive duplicates}

\vspace{0.5em}
\noindent \textbf{Unawareness of parallel work.}
{\color{hltext}
Developers who encounter a problem might prefer to fix the problem by
themselves and submit a pull request,
instead of reporting the problem in the issue tracker and waiting for a fix.
When a problem is encountered by two developers at the same time,
regardless of which developer is the first to work on the problem,
the other developer might also start to work on the problem before the first developer submits a pull request.
In such cases,
both developers are unaware of concurrent activities of each other,
because their local work is conducted offline and is not publicly visible.
For example,
the authors of \textit{emberjs/ember.js/\#4214} and \textit{emberjs/ember.js/\#4223}
individually fixed the same typos in parallel without being aware of each other,
and finally submitted two duplicate pull requests.

}


{\color{hltext}
\vspace{0.5em}
\noindent \textbf{Implementing without claiming first.}
Sometimes,
developer directly start to implement a patch for a GitHub issue
without claiming the issue first (\ie leaving a comment like \textit{``I'm on it''}).
This can introduce a risk that
other interested developers might also start to work on the same issue
without awareness of that there is actually already a developer working on that issue.
For example,
although two developers were both trying to solve the issue
\textit{facebook/react/\#3948},
neither of them claimed the issue before coding their patch.
Finally, they submitted two duplicate pull requests
\textit{facebook/react/\#3949} and \textit{facebook/react/\#3950}.
The phenomenon that developers are not used to claim issues was also reported in previous research~\cite{zhou2019fork}.


\vspace{0.5em}
\noindent \textbf{Overlooking existing claims.}
For a GitHub issue
which has been already claimed by a developer,
if other developers do not check or do not notice the existing claim comment in the issue discussions,
they can prepare duplicate patches.
For example,
a developer \textit{Dev1} first claimed the isssue \textit{scikit-learn/scikit-learn/\#8503} by
leaving a comment \textit{``We are working on this''}.
However,
another developer \textit{Dev2} did not notice this claim
as explained by herself: \textit{``Ah~, nope. I just realized someone was also working on it after I committed''}.
Consequently,
\textit{Dev2} and \textit{Dev1} conducted duplicate development in parallel
and submitted two duplicate pull requests
\textit{scikit-learn/scikit-learn/\#8517} and \textit{scikit-learn/scikit-learn/\#8518}, respectively.


\vspace{0.5em}
\noindent \textbf{Communication failure.}
% miscommunication?
In some cases,
developers communicate in person or online
about a software problem or a desired feature.
However,
their communication might fail,
which does not reach a clear agreement on
who takes the responsibility of submitting a patch.
As a result,
more than one developer might submit a patch, leading to duplicates.
For example,
two duplicate pull requests
\textit{saltstack/salt/pull/\#17691}
and \textit{saltstack/salt/pull/\#17692} were
submitted due to the ambiguity in the communication between the two authors before submitting the pull requests (\textit{``The last sentence was the \textit{\footnotesize 'Anyway, I'd ask you to submit it against 2014.7....'} which was evenly in a separate comment, see the ambiguity ? i thougth and you had changed your mind, so i just pushed that changeset...''}).
}
% the author of \textit{kubernetes/kubernetes/\#1211}
% which is a duplicate of \textit{kubernetes/kubernetes/\#1210} explained that
% the two duplicates were submitted because \textit{``eating lunch together can cause some misunderstanding on who was asked to go add this ability''}.
% }


\vspace{0.5em}
\noindent \textbf{Overlong local work.}
For overlapping and inclusive duplicate pull request pairs,
we calculate the local duration of the work started earlier.
Specifically,
we collect two groups of pull requests:
(a) $OVL$, which includes $mst\_pr$ of each pair of overlapping duplicates,
and (b) $INC$,  which includes $dup\_pr$ of each pair of inclusive duplicates.
Figure~\ref{fig:dur_cmp} plots the duration statistics of each group together along the group \textit{NON}, which includes all non-duplicate pull requests.
We observe that
compared with the pull requests in \textit{NON},
the pull requests in $OVL$ and $INC$ have longer local durations
regarding median measures.
We also find that the difference between \textit{NON} and $OVL$ is significant
according to {\color{hltext}the $\widetilde{\textbf{T}}$-procedure test results,
as shown in table~\ref{tab:dur_sig}}.
This reveals that
overlong local work delays the exposure time of work and
thereby hinders late contributors from realizing in a timely fashion that someone has already done the same work.
As discussed in \cite{gousios2016work}, developers rarely recheck the existence of similar pull request after they have finished the local work.
% Therefore, we suggest contributors
% try to let others know their intended work at the early stage and report their progress in a timely manner if their local coding has taken a long time.


\begin{figure}[h]
	\centering
	\includegraphics[width=0.45\textwidth]{resources/dur_cmp.png}
	\caption{Duration of local work in each group.}
   \label{fig:dur_cmp}
\end{figure}


	\begin{table}[h]
		{\color{hltext}
		\centering
		\caption{{\color{hltext}Results of multiple contrast test procedure}}
		\begin{tabularx}{0.49\textwidth}{@{}c@{\hspace{0.5em}}c@{\hspace{0.4em}}c@{\hspace{0.4em}}c@{\hspace{0.4em}}c@{\hspace{0.4em}}l@{}}
			\toprule
			\textbf{Group A vs. B}& \textbf{Estimator} & \textbf{Lower} & \textbf{Upper} &\textbf{Statistic} &\textbf{p- value}\\

			\midrule
			\textit{NON - OVL} &-0.232 &-1.000 &-0.219   &-38.822 &0.000 ***\\
			\textit{NON - INC} &-0.012 &-1.000 &0.002   &-1.771 &0.098\\
			\textit{OVL - INC} &0.220 &-1.000 &0.238   &25.400 &1.000\\

			\bottomrule
			\multicolumn{6}{l}{\emph{*** p \textless 0.001, ** p \textless 0.01, * p \textless 0.05}}
		\end{tabularx}
		\label{tab:dur_sig}
		}
	\end{table}


\begin{mdframed}[backgroundcolor=gray!8]

\noindent
\textit{
\textbf{RQ2-1:}
We identified 11 contexts where duplicate pull requests occur,
which are mainly relating to
developers' behaviors,
e.g., not checking for existing work, not claiming before coding,
not providing links, and overlong local work,
and their collaborating environment,
e.g., unawareness of parallel work, missing notifications, lack of effective tools for checking for duplicates.
Communication failure
can also result in duplicates.
% missing notifications,
% not checking for existing work,
% large searching space,
% diversity of natural language,
% disappointing search functionality,
% lack of links,
% unawareness of parallel work,
% implementing without claims first,
% overlooking existing claims,
% communication failure,
% and overlong local work.
% Contributors' work patterns,
% e.g., insufficient verification against existing work and overlong local work,
% and their collaborating environment,
% e.g., the lack of tools and mechanisms to support effective duplicate search and maintain good group awareness,
% may result in duplicate pull requests.
}
\end{mdframed}

\subsection{The difference between duplicate and non-duplicate pull requests}
\label{ss:differenceanalysis}


Although some specific cases could be effectively avoided if developers pay attention to their work patterns,
duplicates are difficult to eradicate completely
considering the distributed and spontaneous nature of OSS development.
Therefore,
automatic detection of duplicates is still needed to help reviewers dispose of duplicates faster and in a timely fashion.
Given that prior studies have mainly used a similarity-based method to detect duplicate pull requests~\cite{li2017detecting,ren2019identifying},
we are interested in exploring the difference between duplicate and non-duplicate pull requests from a comparative perspective,
which could offer useful guidance to optimize detection performance.
In particular,
we want to observe what distinguishing characteristics of duplicate pull requests are leading them to be duplicates.
% Following the practices of prior work~\cite{Gousios:2014,hassan2009predicting,yu16det},
% we collect features relating to \textit{contributors’ experience, modification type, modification size, modification hotness, issue visibility}, and \textit{file type}.
% For each feature,
% we present the comparison result between duplicate and non-duplicate pull requests
% and apply statistical analysis to determine the significance of difference.
% Finally,
% we conduct regression analysis to measure the
% effect of the features on pull requests' likelihood of being duplicates.
{\color{hltext}
First,
we
% conduct a literature review and
identify metrics that are used in prior research,
as shown in Section~\ref{ss:metrics}.
Then,
we compare duplicate and non-duplicate pull requests in terms of each metric,
and check whether significant difference could be observed between them
through statistical test,
as described in Section~\ref{ss:compare_explore}.
Furthermore,
in Section~\ref{ss:dup_regression},
we apply a regression analysis to model the correlations
between the collected metrics and pull requests' likelihood of being duplicates.


\subsubsection{Metrics}
\label{ss:metrics}

We conduct a literature review to identify the metrics that have been studied in prior research.
We select influential research papers in the areas of patch submission and acceptance~\cite{jiang2013will,hassan2009predicting,Baysal2015Inv,Pham2013Creating,weissgerber2008small}
and pull request development~\cite{Tsay2014Influence,Gousios:2014,yu16det,zhou2019fork,Rahman2014An,vasilescu2016sky,Bor17Und,Kononenko2018Studying,vasilescu2015gender},
which have been recently published in the leading software engineering journals and conferences,
\eg TSE, ICSE and FSE.
After analyzing these papers,
we identify metrics that can be computed  at  pull  request submission time.
The identified metrics are classified into the following three categories:


\vspace{0.5em}
\noindent\textbf{Project-level characteristics.}

\vspace{0.2em}
\textit{Maturity.}
Previous studies used the metric \texttt{proj\_age},
\ie the period of time from the time the project was hosted on GitHub to the pull request submission time,
as an indicator of the project maturity~\cite{Tsay2014Influence,yu16det,Rahman2014An}.
However,
a project does not necessarily use the pull request model in the first place.
We also use the metric \texttt{prmodel\_age} to indicate
how long a project has used the pull request development model.


\vspace{0.2em}
\textit{\hl{Workload.}}
The discussion of issues and pull requests might cost days to months to come to an end.
At any given time,
a bunch of open issues and pull requests
might be discussed simultaneously.
Prior studies have characterized project \hl{integrators' workload} using two metrics:
\texttt{open\_tasks}~\cite{yu16det} and  \texttt{team\_size}~\cite{Tsay2014Influence,Gousios:2014,yu16det},
which are the number of open issues and open pull requests at the pull request submission time and
the number of active core team members during the last three months, respectively.


\vspace{0.2em}
\textit{Popularity.}
In measuring project popularity,
the metric \texttt{stars},
\ie the total number of stars the project has got prior to the pull request submission,
was commonly used in prior studies~\cite{Bor17Und,Tsay2014Influence}.
In addition,
we also considered three other popularity-related metrics:
\texttt{forks}, \texttt{pullreqs}, and \texttt{contributors},
which are the number of forks, the number of pull requests, and the number of contributors
of the project, respectively.

\vspace{0.2em}
\textit{Hotness.}
This metric (\texttt{hotness}) is the number of total changes on files touched by the pull request three months before the pull request creation time~\cite{Gousios:2014,yu16det}.


% \vspace{0.2em}
% \textit{Activeness.}
% We further measure the activeness of the project
% by including four metrics:
% \texttt{stars\_3M}, \texttt{forks\_3M}, \texttt{prs\_3M}, and \texttt{contributors\_3M},
% which are the new stars, new forks, new pull requests, and active contributors in the project in the last three months, respectively.


\vspace{0.5em}
\noindent\textbf{Submitter-level characteristics.}

\vspace{0.2em}
\textit{Experience.}
Developers' experience before they submit the pull request has been analyzed in prior studies~\cite{Gousios:2014,jiang2013will}.
This measure can be computed from two perspectives:
project-level experience and community-level experience.
The former measures the number of previous pull requests
that the developer have submitted to a specific project (\texttt{prev\_pullreqs\_proj}) and their acceptance rate (\texttt{prev\_prs\_acc\_proj}).
The latter measures the number of previous pull requests
that the developer have submitted to GitHub (\texttt{prev\_pullreqs})  and their acceptance rate (\texttt{prev\_prs\_acc}).
When calculating acceptance rate,
the determination of whether the pull request was integrated
through other mechanisms than GitHub's merge button follows the heuristics defined in previous studies~\cite{Gousios:2014,zhou2019fork}.
We also use two metrics \texttt{first\_pr\_proj} and
\texttt{first\_pr} to represent whether the pull request is the first one submitted by the developer to a specific project and GitHub, respectively.

\vspace{0.2em}
\textit{Standing.}
A dichotomous metric \texttt{core\_team},
which indicates whether the pull request submitter is the core team member of the project,
was commonly used as a signal of the developer's standing within the project~\cite{Tsay2014Influence,yu16det}.
Furthermore,
a continuous metric \texttt{followers},
\ie the number of GitHub users that are following the pull request submitter,
was used to represent the developers' standing in the community~\cite{Tsay2014Influence,Gousios:2014,yu16det}.


\vspace{0.2em}
\textit{Social connection.}
The metric \texttt{prev\_interaction},
which is the total number of events
(\eg such as commenting on issues and pull requests)
prior to the pull request submission
that the developer has participated in within the project,
was usually used to measure the social connection between the developer and the project
~\cite{Tsay2014Influence,yu16det}.


\vspace{0.5em}
\noindent\textbf{Patch-level characteristics.}

\vspace{0.2em}
\textit{Patch size.}
Prior studies~\cite{Tsay2014Influence,Gousios:2014,vasilescu2016sky} quantified the size of a patch,
\ie the changes contained in the pull request,
in different granularity.
The commonly used metrics are
% the number of commits (\texttt{commits}),
the number of changed files (\texttt{files}) and the number of changed lines of codes added and deleted (\texttt{loc}).

\vspace{0.2em}
\textit{Textual length.}
The length of pull request textual content
was used to represent description complexity~\cite{yu16det}.
This metric (\texttt{text\_len}) is
computed by counting the  number of characters in the pull request title and description.


\vspace{0.2em}
\textit{Issue tag.}
This metric (\texttt{issue\_tag}) indicates whether the pull request description contains links to other GitHub issues or pull requests~\cite{Gousios:2014,yu16det},
such as \textit{``fix issue \#1011''}.
We determine this metric
by automatically checking the presence of cross-references in the pull request description
based on regular expression technique.


\vspace{0.2em}
\textit{Type.}
Prior studies~\cite{mockus2000identifying,hassan2009predicting} summarized that  developers can make three primary types of changes:
\textit{fault repairing (FR)},
\textit{feature introduction (FI)},
and \textit{general maintenance (GM)}.
The change type (\texttt{change\_type}) of the pull request
is identified by analyzing its description
based on a set of manually verified keywords~\cite{mockus2000identifying}.
Prior studies~\cite{hindle2007release,vasilescu2014variation} also identified the types of developers' activities
on the basis of the types of changed files.
We follow the classification by Hindle \etal~\cite{hindle2007release},
which include four types:
\textit{Code} changing source code files,
\textit{Test} changing  test files,
\textit{Build} changing build files,
and \textit{Doc} changing documentation files.
This metric (\texttt{activity\_type}) is determined by checking the names and extensions of the files changed by the pull request.


\subsubsection{Comparative exploration}
\label{ss:compare_explore}
In order to explore the differences between duplicate and non-duplicate pull requests,
we compare them in terms of each of the collected metrics and
study to what extent a metric varies across duplicate and non-duplicate pull requests.
Specifically,
% we formulate the following null hypothesis:
we use the following hypotheses:

% \vspace{0.2em}
% $H_{0}$: duplicate pull requests exhibit a value of metric $m$ equal to that one exhibited by non-duplicate pull requests.

\vspace{0.2em}
$H_{0}$: duplicate and non-duplicates pull requests exhibit the same value of metric $m$.

\vspace{0.2em}
$H_{1}$: duplicate
and non-duplicates pull requests exhibit different values of metric $m$.

\vspace{0.2em}
$\forall m \in$
\{
% project
\texttt{proj\_age}, \texttt{prmodel\_age},
\texttt{open\_tasks}, \texttt{team\_size},
\texttt{forks}, \texttt{stars}, \texttt{contributors}, \texttt{pullreqs},  \texttt{hotness},
% submitter
\texttt{prev\_pullreqs}, \texttt{prev\_prs\_porj},   \texttt{first\_pr},
\texttt{prev\_prs\_acc\_proj}, \texttt{first\_pr\_proj},
 \texttt{prev\_prs\_acc},
\texttt{core\_team}, \texttt{followers}, \texttt{prev\_interaction},
% patch
\texttt{loc}, \texttt{files\_changed},
\texttt{text\_len},
\texttt{issue\_tag}, \texttt{change\_type}, \texttt{activity\_type}\}


\vspace{0.2em}
$H_{0}$ is tested with \textit{Mann-Whitney-Wilcoxon} test~\cite{Wilcoxon1945Individual} on continuous metrics and \textit{Chi-square} test~\cite{ramsey2012statistical} on categorical metrics.
The test results are listed in Table~\ref{tab:hypo_test}
which reports the p-value and effect size of each test.
To measure the effect size,
we use \textit{Cliff's delta (d)}~\cite{Long2003Ordinal} as it is a non-parametric approach
which does not require the normality assumption of a distribution.
The p-values are adjusted using the \textit{Benjamini-Hochberg (BH)} method~\cite{Benjamini1995Controlling}  to control the false discovery rate.
% Moreover,
% Figure~\ref{fig:pr_metric_dis} presents the comparison of distributions between duplicate and non-duplicate pull requests on each metric.


We reject H0 and accept H1 when $p$-$value < 0.05$.
We can see that
the null hypothesis is rejected on all metrics
except for \texttt{open\_tasks}.
This means that
duplicate and non-duplicate pull requests
are significantly different in terms of all metrics except for \texttt{open\_tasks}.
Following the previous guidelines~\cite{Thongtanunam2017Revisiting,Ponzanelli2017Supporting} on interpreting the effect size
(trivial: $|d|\le 0.147$; small: $0.147<|d|<0.33$; medium: $0.33 \le |d|<0.474$; large: $|d| \ge 0.474$),
% \hl{
we find that the effect size of difference is generally small with a maximum of 0.285.
% Specifically,
% for most submitter-level metrics,
% the observed effect size is small,
% and for most project-level and patch-level metrics,
% the observed effect size is negligible.
% }


\begin{table}[h]
	\color{hltext}{
	\centering
	\caption{\color{hltext}{The Hypothesis testing results}}
	% \begin{tabularx}{0.49\textwidth}{@{}l@{\hspace{0.5em}}Y@{\hspace{0.2em}}Y@{\hspace{0.2em}}Y Y Y Y@{}}
	\begin{tabularx}{0.49\textwidth}{@{}l r c l@{}}
		\toprule
		\multicolumn{2}{r}{\textbf{Metric}} &\tabincell{c}{\textbf{Adjusted}\\\textbf{p-value}} &\tabincell{c}{\textbf{Effect}\\\textbf{size}} \\


		\midrule\multicolumn{4}{@{}l}{\textbf{Project-level characteristics}}\\

		\cdashline{1-4}[0.8pt/2pt]\multirow{2}{*}{Maturity}
		& \texttt{proj\_age}&4.4e-21 ***& 0.091 \\
		& \texttt{prmodel\_age}&2.8e-11 ***& 0.064 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{2}{*}{Workload}
		& \texttt{open\_tasks}&0.791 & 0.003 \\
		& \texttt{team\_size}&4e-41 ***& 0.131 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{4}{*}{Popularity}
		& \texttt{stars}&2.1e-46 ***& 0.139 \\
		& \texttt{forks}&9.2e-61 ***& 0.159 \\
		& \texttt{contributors}&8.9e-49 ***& 0.142 \\
		& \texttt{pullreqs}&7.1e-19 ***& 0.086 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{1}{*}{Hotness}
		& \texttt{hotness}&4.7e-36 ***& 0.122 \\


		\midrule\multicolumn{4}{@{}l}{\textbf{Submitter-level characteristics}}\\

		\cdashline{1-4}[0.8pt/2pt]\multirow{6}{*}{Experience}
		& \texttt{first\_pr}&1.1e-32 ***& 0.045 \\
		& \texttt{prev\_pullreqs}&8.7e-94 ***& 0.199 \\
		& \texttt{prev\_prs\_acc}&9.7e-90 ***& 0.205 \\
		& \texttt{first\_pr\_proj}&7.3e-148 ***& 0.148 \\
		\multicolumn{2}{@{}r}{\texttt{prev\_pullreqs\_proj}} &1.3e-190 ***& 0.285\\
		& \texttt{prev\_prs\_acc\_proj}&9.7e-52 ***& 0.173 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{2}{*}{Standing}
		& \texttt{core\_team}&2.8e-116 ***& 0.192 \\
		& \texttt{followers}&1.2e-20 ***& 0.090 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{1}{*}{Connection}
		& \texttt{prior\_interaction}&5.9e-107 ***& 0.212 \\


		\midrule\multicolumn{4}{@{}l}{\textbf{Patch-level characteristics}}\\

		\cdashline{1-4}[0.8pt/2pt]\multirow{2}{*}{Size}
		& \texttt{files\_changed}&4.5e-08 ***& 0.050 \\
		& \texttt{loc}&3e-16 ***& 0.079 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{1}{*}{Length}
		& \texttt{text\_len}&1e-65 ***& 0.166 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{1}{*}{Reference}
		& \texttt{issue\_tag}&0.001 **& 0.027 \\

		\cdashline{1-4}[0.8pt/2pt]\multirow{2}{*}{Type}
		& \texttt{change\_type}&9.1e-26 ***& 0.095 \\
		& \texttt{activity\_type}&3.4e-07 ***& 0.003 \\

		\bottomrule
		\multicolumn{4}{l}{\emph{*** p \textless 0.001, ** p \textless 0.01, * p \textless 0.05}}
	\end{tabularx}
	\label{tab:hypo_test}
	}
\end{table}


% \begin{figure*}[h]
% 	\centering
% 	\includegraphics[width=\textwidth]{resources/pr_metric_dist.png}
% 	\caption{The distribution  of duplicate pull requests and non-duplicate pull requests on each metric. (Left: $Duplicate$ , right: $Non$-$Duplicate$.)}
%    \label{fig:pr_metric_dis}
% \end{figure*}


\subsubsection{Regression analysis}
\label{ss:dup_regression}

The comparative exploration does not consider the correlations between metrics.
As a refinement,
we apply a regression analysis to model the effect of the selected metrics on pull requests' likelihood of being duplicates.

\vspace{0.5em}
\noindent\textbf{Regression modeling.}
We build a mixed effect logistic regression model
which is fit to capture the relationship between the explanatory variables, \textit{i.e.}, the metrics discussed in Section~\ref{ss:metrics}, and a response variable, \textit{i.e.}, \textit{is\_dup}, which indicates whether a pull request is a duplicate.
Instead of building one model with all metrics at once,
we add one level metrics at a time and build a model,
which can check whether the addition of the new metrics can significantly improve the model.
As a result,
we compare the fit of three models:
a)~\textit{Model 1}, which includes only project-level variables,
b)~\textit{Model 2}, which adds the submitter-level variables,
and c)~\textit{Model 3}, which adds patch-level variables.
To avoid \hl{project level effects} unrelated to the explanatory variables,
the selected metrics are modeled as fixed effects, and
a new variable \texttt{proj\_id} is modeled as a random effect.
In the model,
all numeric factors are log transformed (plus 0.5 if necessary)
to stabilize variance and reduce heteroscedasticity~\cite{metz1978basic}.
We manually check the distributions of all variables, and conservatively remove not more than 3\% of values as outliers with exponential distributions.
This reduces slightly the size of our dataset onto which we build the
regression models, but ensures that our models are robust against outliers~\cite{osborne2004power}.
%%why necessary?
%which probably decreases the models' robustness because of high leverage.
In addition,
we check for the Spearman correlation of coefficients and the Variance Inflation Factors (VIF below 5 as recommended~\cite{cohen2014applied}) among predictors to overcome the effect of multicollinearity.
This  process leaves us with 17 features,
which can be
seen in Table~\ref{tab:diff_models}.
% \hl{TODO: necesseary to add a metric overview table?}


\begin{table*}[h]
	\color{hltext}{
	\renewcommand{\arraystretch}{1.15}
	\centering
	\caption{\color{hltext}{Statistical models for the likelihood of duplicate pull requests}}
	\begin{tabularx}{\textwidth}{@{}r Y Y Y Y Y Y Y Y Y@{}}
		\toprule


		& \multicolumn{3}{c}{\textbf{Model 1}}& \multicolumn{3}{c}{\textbf{Model 2}} & \multicolumn{3}{c}{\textbf{Model 3}}\\
		 & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}& \multicolumn{3}{c}{response: \textit{is\_dup} = 1} & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}\\
		 \cmidrule(r){2-4} \cmidrule(r){5-7} \cmidrule(r){8-10}
		 & Coeffs. & Errors & Signif. & Coeffs. & Errors & Signif.& Coeffs. & Errors & Signif.\\

\midrule

\texttt{log(proj\_age)}  &  0.005 & 0.064 &  & 0.127 & 0.065 & * & 0.068 & 0.061 &  \\
\texttt{log(open\_tasks + 0.5)}  &  0.247 & 0.043 & *** & 0.199 & 0.042 & *** & 0.192 & 0.042 & *** \\
\texttt{log(team\_size + 0.5)}  &  0.152 & 0.080 & . & 0.209 & 0.079 & ** & 0.256 & 0.080 & ** \\
\texttt{log(watchers + 0.5)}  &  -0.046 & 0.013 & *** & -0.044 & 0.013 & *** & -0.049 & 0.013 & *** \\
\texttt{log(hotness + 0.5)}  &  -0.049 & 0.013 & *** & -0.010 & 0.013 &  & -0.049 & 0.015 & *** \\
\midrule
\texttt{log(prev\_pullreqs + 0.5)}  &  - & - & - & -0.034 & 0.014 & * & -0.020 & 0.014 &  \\
\texttt{log(prev\_prs\_acc + 0.5)}  &  - & - & - & -0.207 & 0.064 & ** & -0.221 & 0.064 & *** \\
\texttt{first\_pr\_proj TRUE}  &  - & - & - & 0.254 & 0.055 & *** & 0.231 & 0.055 & *** \\
\texttt{log(followers + 0.5)}  &  - & - & - & 0.005 & 0.013 &  & 0.002 & 0.013 &  \\
\texttt{core\_team TRUE}  &  - & - & - & -0.222 & 0.056 & *** & -0.207 & 0.056 & *** \\
\texttt{log(prior\_interaction + 0.5)}  &  - & - & - & -0.035 & 0.011 & ** & -0.044 & 0.011 & *** \\
\midrule
\texttt{log(files\_changed + 0.5)}  &  - & - & - & - & - & - & 0.098 & 0.030 & ** \\
\texttt{log(loc + 0.5)}  &  - & - & - & - & - & - & -0.049 & 0.015 & *** \\
\texttt{log(text\_len + 0.5)}  &  - & - & - & - & - & - & 0.120 & 0.017 & *** \\
\texttt{issue\_tag TRUE}  &  - & - & - & - & - & - & 0.102 & 0.038 & ** \\
\texttt{change\_type FI}  &  - & - & - & - & - & - & -0.308 & 0.043 & *** \\
\texttt{change\_type GM}  &  - & - & - & - & - & - & -0.368 & 0.060 & *** \\
\texttt{change\_type Other}  &  - & - & - & - & - & - & -0.291 & 0.048 & *** \\
\texttt{activity\_type Test}  &  - & - & - & - & - & - & -0.285 & 0.072 & *** \\
\texttt{activity\_type Build}  &  - & - & - & - & - & - & 0.020 & 0.084 &  \\
\texttt{activity\_type Doc}  &  - & - & - & - & - & - & -0.317 & 0.059 & *** \\
\texttt{activity\_type Other}  &  - & - & - & - & - & - & -0.006 & 0.060 &  \\


		\midrule
		\multicolumn{1}{r}{Area Under the ROC Curve:} &\multicolumn{3}{c}{0.700}& \multicolumn{3}{c}{0.719} & \multicolumn{3}{c}{0.729}\\

		\bottomrule
		\multicolumn{7}{l}{\emph{*** p \textless 0.001, ** p \textless 0.01, * p \textless 0.05}}

	\end{tabularx}
	\label{tab:diff_models}
	}
\end{table*}


\vspace{0.5em}
\noindent\textbf{Analysis results.}
The analysis results are shown in Table~\ref{tab:diff_models}.
In addition to the coefficient, standard error, and significance level
for each variable,
the table reports the AUC (Area Under the ROC Curve) for each model.
Overall,
Model 3 performs better than the other two Models (AUC:  0.729 vs 0.700/0.719) and they have obtained consistent variable effects
(\ie there is no significant effect flipping from positive to negative and vice versa),
we discuss their effects based on Model 3.


With regard to project-level predictors,
\texttt{open\_tasks} and \texttt{team\_size} have significant, positive effects.
This suggests that
the more open tasks (pull requests and issues) and active core team members at the pull request submission time,
the more likely a pull request is a duplicate.
We assume that the more open tasks the more hard for developers to check for similar work,
and the more active core member the more possible external contributors would collide with them.
Perhaps surprisingly,
the predictor \texttt{watchers} has a strong, negative effect,
which means that the more popular the project becomes
the less likely the submitted pull request is a duplicate.
\hl{One possible explanation is that
when the project popularity increases,
both the numbers of duplicate and non-duplicate pull requests tend to increase,
however, the ratio of increased non-duplicate pull requests is higher than that of duplicate pull requests.}
We also see a negative effect of the  predictor \texttt{hotness}.
This indicates that
changing cold files increases the possibility of pull requests being duplicates.
\hl{We assume that pull requests changing hots files tend to be reviewed faster and get accepted in a timely fashion.}
A quick review can allow the target issue to be solved in short time,
which prevents others from encountering the same issue and submitting duplicate pull requests.


As for submitter-level predictors,
\texttt{prev\_prs\_acc} has a negative effect
and \texttt{first\_pr\_proj} has a positive effect when its value is \texttt{TRUE}.
This suggests that
pull requests submitted by inexperienced developers and newcomers
are more likely to be duplicates.
One possible explanation may be that
inexperienced new developers lack sufficient knowledge in the collaborative development process and practices~\cite{steinmacher2015social},
and cannot effectively avoid duplicated work.
We notice that
the predictors \texttt{prior\_interaction}
and \texttt{core\_team} (\texttt{TRUE}) have significant, negative effects.
This indicates that
pull requests from core team members and developers
who have a stronger social connection to the project
are less likely to be duplicates.
This might be due to their well-maintained awareness of the project status
coming from their active participation in the project~\cite{Gutwin2004Group}.

For patch-level predictors,
two size related predictors present opposite effects.
The predictor \texttt{loc} has a negative effect,
which indicates that
pull requests changing more lines of codes have a less chance of being duplicates.
While the predictor \texttt{files\_changed} has a positive effect,
suggesting that
pull requests changing more files are more likely to be duplicates.
Currently, \hl{it is unclear about  the cause of opposite effects}.
We think this interesting result deserves future investigation.
The predictor \texttt{text\_len} has a positive effect,
indicating that pull requests with complex description
are more likely to be duplicates.
Longer description may indicate higher complexity and thus longer evaluation~\cite{yu16det}
which increases the likelihood of the same issue being encountered by more developers who might also submit a patch for the issue.
The predictor \texttt{issue\_tag} has a positive effect when its value si \texttt{TRUE},
suggesting that pull requests solving already tracked   issues have greater chances of being duplicates.
One possible reason is that
tracked issues are already publicly visible, and
they are more likely to attract more interested developers and result in conflicts.
In terms of change types (\texttt{change\_type}),
we can see that
compared with pull requests of the type \textit{FR},
pull requests of the type \textit{FI}, \textit{GM} and \textit{Other} are less likely to be duplicates.
We speculate that fixing bugs are more likely to produce duplicates
because bugs tend to have general effect on a bigger developer base
compared to new feature or maintenance requirements
which might be specific to a certain group of developers.
For activity types (\texttt{activity\_type}),
we notice that
pull requests changing test files (\texttt{activity Test}) and documentation files (\texttt{activity Doc})
have less chances of being duplicates,
compared with those changing source code files.
OSS projects usually encourage newcomers to try their first contribution by writing documentation and test cases~\cite{TD-guide, TD-label}.
We conjecture that activities changing source code files might require more effort and time to conduct the local work,
which are more risky and prone to duplication.
}%for red color


\begin{mdframed}[backgroundcolor=gray!8]
	\noindent
	\textit{
	\textbf{RQ2-2:}
	Duplicate pull requests are significantly different
	from non-duplicate pull requests in terms of
	project-level characteristics (
	e.g., changing cold files and submitted when the project has more active core team members),
	submitter-level  characteristics (
	e.g., submitted from developers who have less contribution experience and developers who have weaker connection to the project),
	and patch-level characteristics (
	e.g., solving already tracked issues rather than non-tracked issues and fixing bugs rather than adding new features or refactoring).
% 	Duplicate pull requests are significantly different
% 	from non-duplicate pull requests in terms of
% 	whether project-level, submitter-level, or patch-level characteristics.
% 	Pull requests submitted when the project has high popularity \hl{and} low workload, \hl{and} modifying recently modified code are less likely to be duplicates.
% 	Pull requests from core team members and developers with more contribution experience \hl{and} strong social connection to the project are less likely to be duplicates.
% 	Pull requests changing more files with fewer commits and codes, changing source code files, fixing bugs, of complex description, \hl{and} associated with existing tasks are more likely to be duplicates.
	}
\end{mdframed}