duppr_dataset/2-methd.tex

%!TEX root = main.tex

\section{Data collection}
% method of the dataset

\subsection{Studied Projects}
% 介绍20多个调研项目
We study on 26 open source projects hosted in GitHub,
involving 12 programming languages and various application domains
(\eg web-application framework, database and scientific computing library).
Table~\ref{tab:basic_dataset} presents some of statistical characteristics about
the project scale and popularity, \eg the number of PRs, contributors and forks,
which show that they have attracted plenty of attentions from the community.
Also, we can assure that the studied projects have full-fledged and heavy usage of PR mechanism
(minimum number of PRs is 5,050).
More details can be found in the released dataset.

\begin{table}[!htbp]
	\centering
  \caption{The statistical information of studied projects}
  \vspace{-0.2cm}
	\begin{tabular}{r c c c c}
		\bottomrule
		\textbf{Statistic} &\textbf{Min}  &\textbf{Max}	 &\textbf{Mean} &\textbf{Median}\\ \hline
		\#PR		      & 5,050 & 31,600 &10,912 &12,753  \\
		\#Contributor & 518& 3,395 & 1,283 & 1,034 \\
		\#Fork			  & 1,759 & 55,075 & 9,317 & 5,131  \\
		\#Star			  & 1,277 & 117,220 & 25,290 & 16,917  \\
		\#Watch			  & 112 & 7,116 & 1,759 & 1,303  \\
		\toprule
		% \multicolumn{5}{l}{\emph{Note: Co}}
	\end{tabular}
  \vspace{-0.2cm}
	\label{tab:basic_dataset}
  \vspace{-0.2cm}
\end{table}


\subsection{Method}
% In Stack Overflow,
% duplicate questions are indicated by the mark ``[duplicate]'' at the end of question titles,
% and this makes it easy to retrieve all the duplicate questions in Stack Overflow
% by checking the existence of duplicate mark.
% However,
% \hl{there is no explicit and unified mechanism to indicate a duplicate PR in GitHub
% except a pre-defined reply template\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}, December 11, 2017}  provided by the platform.}
% !!!! 这里应该根据实际情况重新说明一下，可以不例比SO，就说GH虽然提供了机制，但是并不是统一的强制的，人们可以以任意一种方式指出。
% Reviewers participate in the discussions of PRs and
% they point out a PR is duplicate to another one in review comments
% when they come across duplicate PRs.

Unlike Stack Overflow,
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
GitHub provides no explicit and unified mechanism to indicate duplicate PRs.
Although reviewers are encouraged to use the pre-defined
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
when they intend to point out a PR is duplicate to another one,
a variety of other comment presentations can also be applied.
%
Therefore, to collect a comprehensive dataset of duplicate PRs in GitHub,
we have to analyze and examine the historical review comments carefully.
% 6. 在数据收集的前面提一下，项目和pr等的原始基本信息是通过官方API获取的（这些可以通过网络爬虫很容易的获取到），而最主要的工作是在这基础数据之上去发现pr之间的dup关系，这也是本章节主要介绍的内容。
The raw data of projects, pull-requests and comments is easily available
with the official API provided by GitHub,
and hence the key point is how to collect duplicate PRs based on those raw data.
In this paper, we design a mixed approach that combines automatic identification and manual examination,
as illustrated in Figure~\ref{fig:data_get},
where the rounded rectangles stand for the actions we have taken
and parallelograms represent the input data or output data of the actions.
The details of our novel collecting process are discussed as follows.
%The details of our novel collecting approach are discussed as follows.


\subsubsection{Random sampling}
For each project,
we randomly sampled 200 review comments
which contain at least one reference
(\ie the number or the url of a PR) to another PR.
Cross-PR references in review comments are the evidence that
some kind of relation exists between two PRs.
In fact, this is also the necessary condition for
finding duplicate PRs
because reviewers have to reference other PRs
when they want to point out the duplicate relation among PRs.
Using cross-PR references as a filter criteria in sampling
can reduce the proportion of noise data in the sampled comments
to be processed in the following action (\ie Manual examination)
and therefore improve the examination efficiency.
% In total, we sampled ** comments expected to be manually identified.

\subsubsection{Manual examination}
For each sampled comment,
we manually examine whether it is a comment
that some reviewer uses to point out the duplicate relation among PRs.
We call such kind of comments \textit{indicative comments}
which can help us to re-construct the duplicate relations.
We would like to note that
quite a number of sampled comments are not indicative comments.
Cross-PR references can also be used to indicate
the relation of conflict, dependency, or association among PRs.
% Totally, we identified ** indicative comments.


\subsubsection{Rules extraction}
We review all the manually identified indicative comments and
tried to extract rules which can be applied lately to
automatically judge whether a  given comment is an indicative comment.
Actually,
some phrases frequently occur when reviewers are stating the duplicate relation between PRs.
Similarly,
we call such phrases as \textit{indicative phrases}.
The followings are several example comments containing indicative phrases.

\begin{itemize}
	\item \textit{ ``dup of \#31372 ''}
	\item \textit{``Closed by https://github.com/rails/rails/pull/13867''}
	\item \textit{``This has been addressed in \#27768.''}
\end{itemize}


In the above example comments,
\textit{``dup of''}, \textit{``closed by''}, and \textit{``addressed in''}
are all the typical indicative phrases.
Together with PR references,
these indicative phrases can be used to compose the identification rules.
An identification rule can be implemented as a regular expression
which is applied to match comment text to identify duplicate relations.
The following items are some simplified rules,
and the complete set of our rules can be found
online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}


\begin{itemize}
	\item \texttt{\footnotesize
			closed by (?:$\backslash$w+:? )\{,5\} (?:\#($\backslash$d+))
			}

	\item \texttt{\footnotesize
			(?:\#($\backslash$d+)):? (?:$\backslash$w+:? )\{,5\} dup(?:licate)?
			}

\end{itemize}


\subsubsection{Automatic identification}
According to the extracted identification rules,
we can automatically identify the indicative comments
and then discover the duplicate PRs.
If a review comment is identified as an indicative comment,
the PR references contained in the comment will be extracted immediately.
Each of the extracted PRs and the PR that the indicative comment belongs to
form a couple of candidate duplicates.
Actually,
we have introduced some preliminary constrains for candidate duplicate PRs.
For example,
a couple of candidate duplicate PRs cannot be submitted by the same contributor.
It is obviously that the same author is aware of the existence of both PRs
which means the duplicate is intentional and
the author submit duplicate PRs for some purpose.
This kind of intentional duplicates are not taken into account in our dataset.
Moreover,
PRs and issues share the same numbering system in GitHub
and issues may also be referenced by the same format as PRs like "\#[number]".
Therefore,
we have to verify the extracted ``PR'' is really a PR, rather than an issue.

\subsubsection{Manual verifying}


It is inevitable that automatic identification may introduce false-positive errors,
that is some identified candidate duplicate PRs are not really duplicate in fact.
To further clean the automatically identified dataset,
we manually examine and verify all the candidate duplicate PRs.
For a couple of candidate duplicates,
the early submitted one is called \textit{master PR}
and the late submitted one is called \textit{duplicate PR} in the paper.
And then we review each couple of candidate duplicates and
label them as ``really duplicate'' if they meet the following criteria:
(a)~the author of duplicate PR is not aware of the existence of the master PR.
It is the default assumption
unless we can find obvious contrary evidence in the discussion history of both PRs.
(b)~reviewers have reached a consensus on the duplicate.
When some reviewer points out a PR is duplicate to another one,
it is necessary that this declaration is responded and affirmed.
One of the most common responses is an immediate close of one of the duplicate PRs.

After manual verifying is finished,
the final dataset of 2,323 pairs of duplicate PRs is constructed.