189 lines
8.7 KiB
TeX
189 lines
8.7 KiB
TeX
%!TEX root = main.tex
|
||
|
||
\section{Data collection}
|
||
% method of the dataset
|
||
|
||
\subsection{Studied Projects}
|
||
% 介绍20多个调研项目
|
||
We study on 26 open source projects hosted in GitHub,
|
||
involving 12 programming languages and various application domains
|
||
(\eg web-application framework, database and scientific computing library).
|
||
Table~\ref{tab:basic_dataset} presents some of statistical characteristics about
|
||
the project scale and popularity, \eg the number of PRs, contributors and forks,
|
||
which show that they have attracted plenty of attentions from the community.
|
||
Also, we can assure that the studied projects have full-fledged and heavy usage of PR mechanism
|
||
(minimum number of PRs is 5,050).
|
||
More details can be found in the released dataset.
|
||
|
||
\begin{table}[!htbp]
|
||
\centering
|
||
\caption{The statistical information of studied projects}
|
||
\vspace{-0.2cm}
|
||
\begin{tabular}{r c c c c}
|
||
\bottomrule
|
||
\textbf{Statistic} &\textbf{Min} &\textbf{Max} &\textbf{Mean} &\textbf{Median}\\ \hline
|
||
\#PR & 5,050 & 31,600 &10,912 &12,753 \\
|
||
\#Contributor & 518& 3,395 & 1,283 & 1,034 \\
|
||
\#Fork & 1,759 & 55,075 & 9,317 & 5,131 \\
|
||
\#Star & 1,277 & 117,220 & 25,290 & 16,917 \\
|
||
\#Watch & 112 & 7,116 & 1,759 & 1,303 \\
|
||
\toprule
|
||
% \multicolumn{5}{l}{\emph{Note: Co}}
|
||
\end{tabular}
|
||
\vspace{-0.2cm}
|
||
\label{tab:basic_dataset}
|
||
\vspace{-0.2cm}
|
||
\end{table}
|
||
|
||
|
||
\subsection{Method}
|
||
% In Stack Overflow,
|
||
% duplicate questions are indicated by the mark ``[duplicate]'' at the end of question titles,
|
||
% and this makes it easy to retrieve all the duplicate questions in Stack Overflow
|
||
% by checking the existence of duplicate mark.
|
||
% However,
|
||
% \hl{there is no explicit and unified mechanism to indicate a duplicate PR in GitHub
|
||
% except a pre-defined reply template\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}, December 11, 2017} provided by the platform.}
|
||
% !!!! 这里应该根据实际情况重新说明一下,可以不例比SO,就说GH虽然提供了机制,但是并不是统一的强制的,人们可以以任意一种方式指出。
|
||
% Reviewers participate in the discussions of PRs and
|
||
% they point out a PR is duplicate to another one in review comments
|
||
% when they come across duplicate PRs.
|
||
|
||
Unlike Stack Overflow,
|
||
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
|
||
GitHub provides no explicit and unified mechanism to indicate duplicate PRs.
|
||
Although reviewers are encouraged to use the pre-defined
|
||
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
|
||
when they intend to point out a PR is duplicate to another one,
|
||
a variety of other comment presentations can also be applied.
|
||
%
|
||
Therefore, to collect a comprehensive dataset of duplicate PRs in GitHub,
|
||
we have to analyze and examine the historical review comments carefully.
|
||
% 6. 在数据收集的前面提一下,项目和pr等的原始基本信息是通过官方API获取的(这些可以通过网络爬虫很容易的获取到),而最主要的工作是在这基础数据之上去发现pr之间的dup关系,这也是本章节主要介绍的内容。
|
||
The raw data of projects, pull-requests and comments is easily available
|
||
with the official API provided by GitHub,
|
||
and hence the key point is how to collect duplicate PRs based on those raw data.
|
||
In this paper, we design a mixed approach that combines automatic identification and manual examination,
|
||
as illustrated in Figure~\ref{fig:data_get},
|
||
where the rounded rectangles stand for the actions we have taken
|
||
and parallelograms represent the input data or output data of the actions.
|
||
The details of our novel collecting process are discussed as follows.
|
||
%The details of our novel collecting approach are discussed as follows.
|
||
|
||
|
||
\subsubsection{Random sampling}
|
||
For each project,
|
||
we randomly sampled 200 review comments
|
||
which contain at least one reference
|
||
(\ie the number or the url of a PR) to another PR.
|
||
Cross-PR references in review comments are the evidence that
|
||
some kind of relation exists between two PRs.
|
||
In fact, this is also the necessary condition for
|
||
finding duplicate PRs
|
||
because reviewers have to reference other PRs
|
||
when they want to point out the duplicate relation among PRs.
|
||
Using cross-PR references as a filter criteria in sampling
|
||
can reduce the proportion of noise data in the sampled comments
|
||
to be processed in the following action (\ie Manual examination)
|
||
and therefore improve the examination efficiency.
|
||
% In total, we sampled ** comments expected to be manually identified.
|
||
|
||
\subsubsection{Manual examination}
|
||
For each sampled comment,
|
||
we manually examine whether it is a comment
|
||
that some reviewer uses to point out the duplicate relation among PRs.
|
||
We call such kind of comments \textit{indicative comments}
|
||
which can help us to re-construct the duplicate relations.
|
||
We would like to note that
|
||
quite a number of sampled comments are not indicative comments.
|
||
Cross-PR references can also be used to indicate
|
||
the relation of conflict, dependency, or association among PRs.
|
||
% Totally, we identified ** indicative comments.
|
||
|
||
|
||
\subsubsection{Rules extraction}
|
||
We review all the manually identified indicative comments and
|
||
tried to extract rules which can be applied lately to
|
||
automatically judge whether a given comment is an indicative comment.
|
||
Actually,
|
||
some phrases frequently occur when reviewers are stating the duplicate relation between PRs.
|
||
Similarly,
|
||
we call such phrases as \textit{indicative phrases}.
|
||
The followings are several example comments containing indicative phrases.
|
||
|
||
\begin{itemize}
|
||
\item \textit{ ``dup of \#31372 ''}
|
||
\item \textit{``Closed by https://github.com/rails/rails/pull/13867''}
|
||
\item \textit{``This has been addressed in \#27768.''}
|
||
\end{itemize}
|
||
|
||
|
||
In the above example comments,
|
||
\textit{``dup of''}, \textit{``closed by''}, and \textit{``addressed in''}
|
||
are all the typical indicative phrases.
|
||
Together with PR references,
|
||
these indicative phrases can be used to compose the identification rules.
|
||
An identification rule can be implemented as a regular expression
|
||
which is applied to match comment text to identify duplicate relations.
|
||
The following items are some simplified rules,
|
||
and the complete set of our rules can be found
|
||
online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}
|
||
|
||
|
||
\begin{itemize}
|
||
\item \texttt{\footnotesize
|
||
closed by (?:$\backslash$w+:? )\{,5\} (?:\#($\backslash$d+))
|
||
}
|
||
|
||
\item \texttt{\footnotesize
|
||
(?:\#($\backslash$d+)):? (?:$\backslash$w+:? )\{,5\} dup(?:licate)?
|
||
}
|
||
|
||
\end{itemize}
|
||
|
||
|
||
\subsubsection{Automatic identification}
|
||
According to the extracted identification rules,
|
||
we can automatically identify the indicative comments
|
||
and then discover the duplicate PRs.
|
||
If a review comment is identified as an indicative comment,
|
||
the PR references contained in the comment will be extracted immediately.
|
||
Each of the extracted PRs and the PR that the indicative comment belongs to
|
||
form a couple of candidate duplicates.
|
||
Actually,
|
||
we have introduced some preliminary constrains for candidate duplicate PRs.
|
||
For example,
|
||
a couple of candidate duplicate PRs cannot be submitted by the same contributor.
|
||
It is obviously that the same author is aware of the existence of both PRs
|
||
which means the duplicate is intentional and
|
||
the author submit duplicate PRs for some purpose.
|
||
This kind of intentional duplicates are not taken into account in our dataset.
|
||
Moreover,
|
||
PRs and issues share the same numbering system in GitHub
|
||
and issues may also be referenced by the same format as PRs like "\#[number]".
|
||
Therefore,
|
||
we have to verify the extracted ``PR'' is really a PR, rather than an issue.
|
||
|
||
\subsubsection{Manual verifying}
|
||
|
||
|
||
It is inevitable that automatic identification may introduce false-positive errors,
|
||
that is some identified candidate duplicate PRs are not really duplicate in fact.
|
||
To further clean the automatically identified dataset,
|
||
we manually examine and verify all the candidate duplicate PRs.
|
||
For a couple of candidate duplicates,
|
||
the early submitted one is called \textit{master PR}
|
||
and the late submitted one is called \textit{duplicate PR} in the paper.
|
||
And then we review each couple of candidate duplicates and
|
||
label them as ``really duplicate'' if they meet the following criteria:
|
||
(a)~the author of duplicate PR is not aware of the existence of the master PR.
|
||
It is the default assumption
|
||
unless we can find obvious contrary evidence in the discussion history of both PRs.
|
||
(b)~reviewers have reached a consensus on the duplicate.
|
||
When some reviewer points out a PR is duplicate to another one,
|
||
it is necessary that this declaration is responded and affirmed.
|
||
One of the most common responses is an immediate close of one of the duplicate PRs.
|
||
|
||
After manual verifying is finished,
|
||
the final dataset of 2,323 pairs of duplicate PRs is constructed.
|