64 lines
2.8 KiB
TeX
64 lines
2.8 KiB
TeX
%!TEX root = main.tex
|
|
|
|
\section{Description of dataset}
|
|
|
|
% 数据表概况,每张数据表字段介绍
|
|
We store the dataset \textit{DupPR} in a MySQL database,
|
|
and make the script files available online in GitHub.
|
|
Figure~\ref{fig:schemas_h} illustrates the schema of \textit{DupPR}.
|
|
There are four tables in \textit{DupPR}
|
|
and the fields in these tables are defined as follows.
|
|
|
|
|
|
|
|
% \begin{figure}[!htbp]
|
|
% \centering
|
|
% \includegraphics[width=0.4\textwidth]{figs/schemas_v.pdf}
|
|
% \caption{The Schemas of our dataset}
|
|
% \label{fig:schemas_v}
|
|
% \end{figure}
|
|
|
|
|
|
\begin{figure}[!htbp]
|
|
\centering
|
|
\includegraphics[width=0.44\textwidth]{figs/schemas_h.pdf}
|
|
\caption{The Schema of \textit{DupPR} dataset}
|
|
\label{fig:schemas_h}
|
|
\end{figure}
|
|
|
|
\begin{itemize}[leftmargin=0em,itemindent=2em]
|
|
\item Table \texttt{Project} stores the basic information of studied projects.
|
|
Field \texttt{user\_name} is the name of the user owning the project in GitHub,
|
|
and field \texttt{repo\_name} is the name of the project.
|
|
These two fields, together with the domain name of GitHub,
|
|
can be used to compose the resource locator of the project in GitHub.
|
|
Other fields in table \texttt{Project} present some statistical characteristics of a project,
|
|
for example \texttt{fork\_count} is the number of forks.
|
|
|
|
\item For each project,
|
|
all the PRs belonged to it are stored in table \texttt{Pull-request}.
|
|
Field \texttt{prj\_id} is the value of \texttt{id} of the project.
|
|
and \texttt{pr\_num}, \texttt{title} and \texttt{description} represent the number label (generated by GitHub),
|
|
the title and the description of a PR respectively.
|
|
Moreover, field \texttt{pr\_num} can be used to uniquely locate a PR in the addressing space of a project in GitHub.
|
|
Fields \texttt{author} and \texttt{created\_at} mean a PR is submitted by the GitHub user named \texttt{author}
|
|
at the time of \texttt{created\_at}.
|
|
|
|
\item For a pull-request in Table \texttt{Pull-request},
|
|
comments on it are stored in table \texttt{Comment}.
|
|
For table \texttt{Comment},
|
|
filed \texttt{pr\_id} is the value of \texttt{id} of the pull-request.
|
|
The text content, the creation time and the author of a comment are represented by fields \texttt{content}, \texttt{created\_at},
|
|
and \texttt{author} respectively.
|
|
|
|
\item Table \texttt{Duplicate} contains all the duplicate PR-pairs.
|
|
Field \texttt{prj\_id} is the value of \texttt{id} of the project that a pair of duplicate PRs belong to.
|
|
For a pair of duplicate PRs,
|
|
field \texttt{mst\_pr} is the number of the PR that is submitted early
|
|
and filed \texttt{dup\_pr} is the number of the PR that is submitted late.
|
|
Field \texttt{idn\_cmt} is the first indicative comment
|
|
% belonging to \texttt{mst\_pr} or \texttt{dup\_pr},
|
|
that points out the duplicate relation between \texttt{mst\_pr} and \texttt{dup\_pr}.
|
|
|
|
\end{itemize}
|