duppr_dataset/3-desc.tex

%!TEX root = main.tex

\section{Description of dataset}

% 数据表概况，每张数据表字段介绍
We store the dataset \textit{DupPR} in a MySQL database,
and make the script files available online in GitHub.
Figure~\ref{fig:schemas_h} illustrates the schema of \textit{DupPR}.
There are four tables in \textit{DupPR}
and the fields in these tables are defined as follows.


% \begin{figure}[!htbp]
%  	\centering
%  	\includegraphics[width=0.4\textwidth]{figs/schemas_v.pdf}
%  	\caption{The Schemas of our dataset}
% 	\label{fig:schemas_v}
% \end{figure}


\begin{figure}[!htbp]
 	\centering
 	\includegraphics[width=0.44\textwidth]{figs/schemas_h.pdf}
 	\caption{The Schema of \textit{DupPR} dataset}
	\label{fig:schemas_h}
\end{figure}

\begin{itemize}[leftmargin=0em,itemindent=2em]
	 \item Table \texttt{Project} stores the basic information of studied projects.
	Field \texttt{user\_name} is the name of the user owning the project in GitHub,
	and field \texttt{repo\_name} is the name of the project.
	These two fields, together with the domain name of GitHub,
	can be used to compose the resource locator of the project in GitHub.
	Other fields in table \texttt{Project}  present some statistical characteristics of a project,
	for example \texttt{fork\_count} is the number of forks.

	\item For each project,
	all the PRs belonged to it are stored in table \texttt{Pull-request}.
	Field \texttt{prj\_id} is the value of \texttt{id} of the project.
	and \texttt{pr\_num}, \texttt{title} and \texttt{description} represent the number label (generated by GitHub),
	the title and the description of a PR respectively.
	Moreover, field \texttt{pr\_num} can be used to uniquely locate a PR in the addressing space of a project in GitHub.
	Fields \texttt{author} and \texttt{created\_at} mean a PR is submitted by the GitHub user named \texttt{author}
	at the time of \texttt{created\_at}.

	\item For a pull-request in Table \texttt{Pull-request},
	comments on it are stored in table \texttt{Comment}.
	For table \texttt{Comment},
	filed \texttt{pr\_id} is the value of \texttt{id} of the pull-request.
	The text content, the creation time and the author of a comment are represented by fields \texttt{content}, \texttt{created\_at},
	and \texttt{author} respectively.

	\item Table \texttt{Duplicate} contains all the duplicate PR-pairs.
	Field \texttt{prj\_id}  is the value of \texttt{id} of the project that a pair of duplicate PRs belong to.
	For a pair of duplicate PRs,
	field \texttt{mst\_pr} is the number of the PR that is submitted early
	and filed \texttt{dup\_pr} is the number of the PR that is submitted late.
	Field \texttt{idn\_cmt} is the first indicative comment
	% belonging to \texttt{mst\_pr} or \texttt{dup\_pr},
	that points out the duplicate relation between \texttt{mst\_pr} and \texttt{dup\_pr}.

\end{itemize}