duppr_analysis/3_dataset.tex

\section{Dataset}
\label{sec:dsrq}

In this study,
we leverage our previous dataset \textit{DupPR}~\cite{dup2018},
which %describes
contains the duplicate relations among pull requests and the profiles and review comments of pull requests from 26 popular OSS projects hosted on GitHub.
We also extend the dataset by adding complementary data, including code commits, check statuses of DevOps tools and contribution histories of developers.  %%%do you put the dataset somewhere?

% \subsection{Construction of \textit{DupPR}}  %%%since this is in the existing work, there is no need to describe its construction process in such detail, you may describe what a record is in DupPR instead of the construction process.

\subsection{\textit{DupPR} basic dataset}

In our prior work~\cite{dup2018},
we have built an unique dataset of more than 2,000 pairs of duplicate pull requests (called \textit{DupPR}~\cite{MSR2018-DupPR}) by analyzing the review comments from 26 popular OSS projects hosted on GitHub.
Each pair of duplicates in \textit{DupPR} is represented in a quaternion as \texttt{<proj, pr1, pr2, idtf\_cmt>}.
Item \texttt{proj} indicates the project (\eg \textit{rails/rails})
that the duplicate pull requests belong to.
Items \texttt{pr1} and \texttt{pr2} are the tracking numbers of the two pull requests, respectively.
Item \texttt{idn\_cmt} is a review comment of either \texttt{pr1} or \texttt{pr2},
which is used by reviewers to state the duplicate relation between \texttt{pr1} and \texttt{pr2}.
{\color{hltext}
The dataset meant to only contain the accidental duplicates
of which all the authors were not aware of the other pull request when creating their own pull request.
In order to increase the accuracy of the dataset,
we recheck it again and filter out the intentional duplicates
that were not found before.
Specifically,
we omit duplicates from the dataset when they fit one of the following criteria:
\textit{i)} The authors' discussion on the associated issue reveals that the duplication was on purpose.
A representative comment indicating intentional duplication is \textit{``I saw your PR and there wasn't any activity or follow up in that from last 18 days''};
\textit{ii)} One author performed some kinds of actions on the other pull request before submitting her/his own.
In addition to commenting,
actions like assigning reviewers and adding labels are also taken into consideration;
and \textit{iii)} One author immediately mentioned the other pull request after submitting her/his own,
which means that the author might already know that pull request before.
}


A pull request might be duplicate of more than one pull request.
Therefore, in this study,
we organize a group of duplicate pull requests in a
tuple structure $<\!dup_1,\ dup_2,\ dup_3,\ ..., dup_n\!>$
in which the items are sorted by their submission time.
% {\color{hltext}
% \hl{In order to prevent introducing extra complexity to our subsequent analyses},
% we also exclude the tuples in which not each two pull requests are accidentally duplicate
% (it is possible that a pull request is duplicate of two pull requests  submitted by the same author).
% }
In total,
% 2049
we have {\color{hltext}1,751} tuples of duplicate pull requests
and their distribution over different tuple sizes is shown in Table~\ref{tab:tuple_sta}.

\begin{table}[h]
	{\color{hltext}
	\centering
	\caption{{\color{hltext}The distribution of tuples over different sizes}}
	\begin{tabularx}{0.49\textwidth}{@{}l Y Y Y Y Y c @{}}
		\toprule
		\textbf{Size} & 2 & 3 & 4 & 5 & 6 & 7\\	\midrule
		\textbf{Count} & 1,664 & 69 & 10 & 5 & 2& 1\\
		\bottomrule
	\end{tabularx}
	\label{tab:tuple_sta}
	}
\end{table}

\subsection{Collecting complementary data}

\subsubsection{Patch detail}
GitHub API (\texttt{/repos/:owner/:repo/pulls/:pull\_nu mber/commits}) allows us to retrieve the commits on each pull request.
From the returned results,
we parse the \texttt{sha} of each commit and
request the API (\texttt{/repos/:owner/:repo/ commits/:commit\_sha}) to return more detailed information about a commit,
including \texttt{author} and \texttt{author\_date}.
Moreover,
the API (\texttt{/repos/:owner/:repo/pulls/:pull\_number/files})
returns the files changed by a pull request,
from which we can pare the \texttt{filename} and \texttt{changes}
 (lines of code added and deleted) of each changed file.

\subsubsection{Check statuses}
\label{cs}
Various DevOps tools are seamlessly integrated and widely used in GitHub;
examples are Travis-CI~\cite{travis} for continuous integration and Code-Climate~\cite{climate} for static analysis.
When a pull request has been submitted or updated,
a set of DevOps tools are automatically launched to check
whether the pull request can be safely merged back to the codebase.
GitHub API (\texttt{/repos/:owner/:repo/commits/:ref/status}) returns
the check statuses for a specific commit.
There are two different levels of statuses in the returned results.
Because multiple DevOps tools can be used to check a commit,
each tool is associated with a check status,
which we call the \textit{context-level check status}.
For each context-level check status,
we can parse the \texttt{state} and \texttt{context} fields.
{\color{hltext}
The \texttt{state} of a check can be designated
\textit{success}, \textit{failure}, \textit{pending}, or \textit{error}.}
\textit{success} means a check has successfully passed,
while \textit{failure} indicates that the check has failed.
If the check is still running and no result is returned,
its \texttt{state} is \textit{pending}.
{\color{hltext}
\textit{error} indicates a check did not successfully run and produced a error.
Following the guidelines of prior work~\cite{bel16oop,souza2017sentiment},
we treat the state error as the same as failure, which are both opposed to success.
}
The \texttt{context} indicates which tool is used in a specific check.
{\color{hltext}
According to the bot taxonomy defined in prior study~\cite{Wessel2018bot},
the checking tools can be classified into three categories:
CI (report continuous integration test results, \eg \texttt{continuous-integration/travis-ci}),
CLA (ensure license agreement signing \eg \texttt{cla/google}),
and CR (review source code \eg \texttt{coverage/coveralls} and \texttt{codeclimate}).
Based on all context-level check statuses of a commit,
the API also returns a overall check status of that commit~\cite{statusrule},
which we call the \textit{commit-level check status}.
The \texttt{state} of a commit-level check can be one of
\textit{success}, \textit{failure} and \textit{pending}.
}


{\color{hltext}
\subsubsection{Timeline events}

GitHub API (\texttt{/repos/:owner/:repo/issues/:issue\_ number/events})
returns the events triggered by activities (\eg assigning a label and posting a comment) in issues and pull requests.
We request this API for eah pull request.
From the returned result,
we can parse who (\texttt{actor}) triggered
which event (\texttt{event}) at what time (\texttt{created\_at}).
For \texttt{close} events,
we can parse
which commit (\texttt{commit\_id}, aka SHA) closed the pull request.
Events data are mainly used for rechecking dataset
and determining pull request acceptance.
}


\subsubsection{Contribution histories}
Rather than requesting the GitHub API,
we use the GHTorrent dataset~\cite{Gousios:2014},
which makes it easier and more efficient to obtain the entire contribution history
for a specific developer in GitHub.
GHTorrent stores its data in several tables and
we mainly use
\texttt{pull\_requests} (\texttt{PR}), \texttt{issues},
\texttt{pull\_request\_history}  (\texttt{PRH}),
\texttt{pull\_request\_comments} (\texttt{PRC}),
and \texttt{issue\_comments} (\texttt{ISC}).
From table \texttt{PR}, table \texttt{PRH}, and table \texttt{PRC},
we can parse
who (\texttt{PRH.actor\_id}) submitted
which pull request (\texttt{PR.pullreq\_id})
to which project (\texttt{PR.base\_repo\_id})
at what time (\texttt{PRH.created\_at})
and who (\texttt{PRC.user\_id}) have commented on that pull request
at what time (\texttt{PRC.created\_at}).
Similarly,
from table \texttt{issues} and  table \texttt{ISC},
we can parse
who (\texttt{issues.reporter\_id}) reported
which issue (\texttt{issues.issue\_id})
to which project (\texttt{issues.repo\_id})
at what time (\texttt{issues.created\_at})
and who (\texttt{ISC.user\_id}) commented on that issue at what time (\texttt{ISC.created\_at}).
Based on this information
we can acquire the whole contribution history for a specific developer.

it mean that.

{\color{hltext}
\subsubsection{Poularity and reputation}

GHTorrent also provides tables relating to project popularity and developer reputation.
From table \texttt{watchers},
we can parse who (\texttt{user\_id}) started to star which project (\texttt{repo\_id}) at what time (\texttt{created\_at}).
From table \texttt{projects},
we can parse which project (\texttt{id}) was forked from which project (\texttt{forked\_from}) at what time (\texttt{created\_at}).
From table \texttt{followers},
we can parse who (\texttt{followers}) started to follow whom (\texttt{user\_id})
at what time (\texttt{created\_at}).

}