186 lines
9.0 KiB
TeX
186 lines
9.0 KiB
TeX
\section{Dataset}
|
|
\label{sec:dsrq}
|
|
|
|
In this study,
|
|
we leverage our previous dataset \textit{DupPR}~\cite{dup2018},
|
|
which %describes
|
|
contains the duplicate relations among pull requests and the profiles and review comments of pull requests from 26 popular OSS projects hosted on GitHub.
|
|
We also extend the dataset by adding complementary data, including code commits, check statuses of DevOps tools and contribution histories of developers. %%%do you put the dataset somewhere?
|
|
|
|
% \subsection{Construction of \textit{DupPR}} %%%since this is in the existing work, there is no need to describe its construction process in such detail, you may describe what a record is in DupPR instead of the construction process.
|
|
|
|
\subsection{\textit{DupPR} basic dataset}
|
|
|
|
In our prior work~\cite{dup2018},
|
|
we have built an unique dataset of more than 2,000 pairs of duplicate pull requests (called \textit{DupPR}~\cite{MSR2018-DupPR}) by analyzing the review comments from 26 popular OSS projects hosted on GitHub.
|
|
Each pair of duplicates in \textit{DupPR} is represented in a quaternion as \texttt{<proj, pr1, pr2, idtf\_cmt>}.
|
|
Item \texttt{proj} indicates the project (\eg \textit{rails/rails})
|
|
that the duplicate pull requests belong to.
|
|
Items \texttt{pr1} and \texttt{pr2} are the tracking numbers of the two pull requests, respectively.
|
|
Item \texttt{idn\_cmt} is a review comment of either \texttt{pr1} or \texttt{pr2},
|
|
which is used by reviewers to state the duplicate relation between \texttt{pr1} and \texttt{pr2}.
|
|
{\color{hltext}
|
|
The dataset meant to only contain the accidental duplicates
|
|
of which all the authors were not aware of the other pull request when creating their own pull request.
|
|
In order to increase the accuracy of the dataset,
|
|
we recheck it again and filter out the intentional duplicates
|
|
that were not found before.
|
|
Specifically,
|
|
we omit duplicates from the dataset when they fit one of the following criteria:
|
|
\textit{i)} The authors' discussion on the associated issue reveals that the duplication was on purpose.
|
|
A representative comment indicating intentional duplication is \textit{``I saw your PR and there wasn't any activity or follow up in that from last 18 days''};
|
|
\textit{ii)} One author performed some kinds of actions on the other pull request before submitting her/his own.
|
|
In addition to commenting,
|
|
actions like assigning reviewers and adding labels are also taken into consideration;
|
|
and \textit{iii)} One author immediately mentioned the other pull request after submitting her/his own,
|
|
which means that the author might already know that pull request before.
|
|
}
|
|
|
|
|
|
A pull request might be duplicate of more than one pull request.
|
|
Therefore, in this study,
|
|
we organize a group of duplicate pull requests in a
|
|
tuple structure $<\!dup_1,\ dup_2,\ dup_3,\ ..., dup_n\!>$
|
|
in which the items are sorted by their submission time.
|
|
% {\color{hltext}
|
|
% \hl{In order to prevent introducing extra complexity to our subsequent analyses},
|
|
% we also exclude the tuples in which not each two pull requests are accidentally duplicate
|
|
% (it is possible that a pull request is duplicate of two pull requests submitted by the same author).
|
|
% }
|
|
In total,
|
|
% 2049
|
|
we have {\color{hltext}1,751} tuples of duplicate pull requests
|
|
and their distribution over different tuple sizes is shown in Table~\ref{tab:tuple_sta}.
|
|
|
|
\begin{table}[h]
|
|
{\color{hltext}
|
|
\centering
|
|
\caption{{\color{hltext}The distribution of tuples over different sizes}}
|
|
\begin{tabularx}{0.49\textwidth}{@{}l Y Y Y Y Y c @{}}
|
|
\toprule
|
|
\textbf{Size} & 2 & 3 & 4 & 5 & 6 & 7\\ \midrule
|
|
\textbf{Count} & 1,664 & 69 & 10 & 5 & 2& 1\\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\label{tab:tuple_sta}
|
|
}
|
|
\end{table}
|
|
|
|
\subsection{Collecting complementary data}
|
|
|
|
\subsubsection{Patch detail}
|
|
GitHub API (\texttt{/repos/:owner/:repo/pulls/:pull\_nu mber/commits}) allows us to retrieve the commits on each pull request.
|
|
From the returned results,
|
|
we parse the \texttt{sha} of each commit and
|
|
request the API (\texttt{/repos/:owner/:repo/ commits/:commit\_sha}) to return more detailed information about a commit,
|
|
including \texttt{author} and \texttt{author\_date}.
|
|
Moreover,
|
|
the API (\texttt{/repos/:owner/:repo/pulls/:pull\_number/files})
|
|
returns the files changed by a pull request,
|
|
from which we can pare the \texttt{filename} and \texttt{changes}
|
|
(lines of code added and deleted) of each changed file.
|
|
|
|
\subsubsection{Check statuses}
|
|
\label{cs}
|
|
Various DevOps tools are seamlessly integrated and widely used in GitHub;
|
|
examples are Travis-CI~\cite{travis} for continuous integration and Code-Climate~\cite{climate} for static analysis.
|
|
When a pull request has been submitted or updated,
|
|
a set of DevOps tools are automatically launched to check
|
|
whether the pull request can be safely merged back to the codebase.
|
|
GitHub API (\texttt{/repos/:owner/:repo/commits/:ref/status}) returns
|
|
the check statuses for a specific commit.
|
|
There are two different levels of statuses in the returned results.
|
|
Because multiple DevOps tools can be used to check a commit,
|
|
each tool is associated with a check status,
|
|
which we call the \textit{context-level check status}.
|
|
For each context-level check status,
|
|
we can parse the \texttt{state} and \texttt{context} fields.
|
|
{\color{hltext}
|
|
The \texttt{state} of a check can be designated
|
|
\textit{success}, \textit{failure}, \textit{pending}, or \textit{error}.}
|
|
\textit{success} means a check has successfully passed,
|
|
while \textit{failure} indicates that the check has failed.
|
|
If the check is still running and no result is returned,
|
|
its \texttt{state} is \textit{pending}.
|
|
{\color{hltext}
|
|
\textit{error} indicates a check did not successfully run and produced a error.
|
|
Following the guidelines of prior work~\cite{bel16oop,souza2017sentiment},
|
|
we treat the state error as the same as failure, which are both opposed to success.
|
|
}
|
|
The \texttt{context} indicates which tool is used in a specific check.
|
|
{\color{hltext}
|
|
According to the bot taxonomy defined in prior study~\cite{Wessel2018bot},
|
|
the checking tools can be classified into three categories:
|
|
CI (report continuous integration test results, \eg \texttt{continuous-integration/travis-ci}),
|
|
CLA (ensure license agreement signing \eg \texttt{cla/google}),
|
|
and CR (review source code \eg \texttt{coverage/coveralls} and \texttt{codeclimate}).
|
|
Based on all context-level check statuses of a commit,
|
|
the API also returns a overall check status of that commit~\cite{statusrule},
|
|
which we call the \textit{commit-level check status}.
|
|
The \texttt{state} of a commit-level check can be one of
|
|
\textit{success}, \textit{failure} and \textit{pending}.
|
|
}
|
|
|
|
|
|
{\color{hltext}
|
|
\subsubsection{Timeline events}
|
|
|
|
GitHub API (\texttt{/repos/:owner/:repo/issues/:issue\_ number/events})
|
|
returns the events triggered by activities (\eg assigning a label and posting a comment) in issues and pull requests.
|
|
We request this API for eah pull request.
|
|
From the returned result,
|
|
we can parse who (\texttt{actor}) triggered
|
|
which event (\texttt{event}) at what time (\texttt{created\_at}).
|
|
For \texttt{close} events,
|
|
we can parse
|
|
which commit (\texttt{commit\_id}, aka SHA) closed the pull request.
|
|
Events data are mainly used for rechecking dataset
|
|
and determining pull request acceptance.
|
|
}
|
|
|
|
|
|
\subsubsection{Contribution histories}
|
|
Rather than requesting the GitHub API,
|
|
we use the GHTorrent dataset~\cite{Gousios:2014},
|
|
which makes it easier and more efficient to obtain the entire contribution history
|
|
for a specific developer in GitHub.
|
|
GHTorrent stores its data in several tables and
|
|
we mainly use
|
|
\texttt{pull\_requests} (\texttt{PR}), \texttt{issues},
|
|
\texttt{pull\_request\_history} (\texttt{PRH}),
|
|
\texttt{pull\_request\_comments} (\texttt{PRC}),
|
|
and \texttt{issue\_comments} (\texttt{ISC}).
|
|
From table \texttt{PR}, table \texttt{PRH}, and table \texttt{PRC},
|
|
we can parse
|
|
who (\texttt{PRH.actor\_id}) submitted
|
|
which pull request (\texttt{PR.pullreq\_id})
|
|
to which project (\texttt{PR.base\_repo\_id})
|
|
at what time (\texttt{PRH.created\_at})
|
|
and who (\texttt{PRC.user\_id}) have commented on that pull request
|
|
at what time (\texttt{PRC.created\_at}).
|
|
Similarly,
|
|
from table \texttt{issues} and table \texttt{ISC},
|
|
we can parse
|
|
who (\texttt{issues.reporter\_id}) reported
|
|
which issue (\texttt{issues.issue\_id})
|
|
to which project (\texttt{issues.repo\_id})
|
|
at what time (\texttt{issues.created\_at})
|
|
and who (\texttt{ISC.user\_id}) commented on that issue at what time (\texttt{ISC.created\_at}).
|
|
Based on this information
|
|
we can acquire the whole contribution history for a specific developer.
|
|
|
|
it mean that.
|
|
|
|
{\color{hltext}
|
|
\subsubsection{Poularity and reputation}
|
|
|
|
GHTorrent also provides tables relating to project popularity and developer reputation.
|
|
From table \texttt{watchers},
|
|
we can parse who (\texttt{user\_id}) started to star which project (\texttt{repo\_id}) at what time (\texttt{created\_at}).
|
|
From table \texttt{projects},
|
|
we can parse which project (\texttt{id}) was forked from which project (\texttt{forked\_from}) at what time (\texttt{created\_at}).
|
|
From table \texttt{followers},
|
|
we can parse who (\texttt{followers}) started to follow whom (\texttt{user\_id})
|
|
at what time (\texttt{created\_at}).
|
|
|
|
} |