through again

This commit is contained in:
starlee 2018-03-14 10:55:49 +08:00
parent b605b290f8
commit f3053b5fe7
6 changed files with 9 additions and 9 deletions

View File

@ -12,8 +12,8 @@ which increase the extra cost of project maintenance.
%for both reviewers and contributors.
To facilitate the further studies to better understand and solve the issues
introduced by duplicate PRs,
we construct a large dataset of historical duplicate extracted from
PRs 26 popular open source projects in GitHub by using a semi-automatic approach.
we construct a large dataset of historical duplicate PRs extracted from
26 popular open source projects in GitHub by using a semi-automatic approach.
Furthermore, we present some preliminary applications
to illustrate how further researches can be conducted based on this dataset.

View File

@ -4,7 +4,7 @@
% 在GitHub 基于PR的贡献流程
In GiHub,
pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
the pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
lowers the contribution entry for community developers and
prompts the development and evolution of numerous open source software projects.
Any contributor can fork (\ie clone) a repository
@ -61,7 +61,7 @@ we constructed a large dataset of historical duplicate PRs
(called \textit{DupPR})
extracted from 26 open source projects in GitHub.
Each pair of duplicate PRs in \textit{DupPR}
has been manually verified after a automatic identification process,
has been manually verified after an automatic identification process,
which would guarantee the quality of this dataset.
%The dataset and the source code used to recreate it is available
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}

View File

@ -10,7 +10,7 @@ involving 12 programming languages and various application domains
(\eg web-application framework, database and scientific computing library).
Table~\ref{tab:basic_dataset} presents some of statistical characteristics about
the project scale and popularity, \eg the number of PRs, contributors and forks,
which shows that they have attracted plenty of attentions from the community.
which show that they have attracted plenty of attentions from the community.
Also, we can assure that the studied projects have full-fledged and heavy usage of PR mechanism
(minimum number of PRs is 5,050).
More details can be found in the released dataset.
@ -51,7 +51,7 @@ More details can be found in the released dataset.
Unlike Stack Overflow,
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
GitHub provides no explicit and unified mechanism to indicate duplicates PR.
GitHub provides no explicit and unified mechanism to indicate duplicate PRs.
Although reviewers are encouraged to use the pre-defined
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
when they intend to point out a PR is duplicate to another one,

View File

@ -41,7 +41,7 @@ and the fields in these tables are defined as follows.
and \texttt{pr\_num}, \texttt{title} and \texttt{description} represent the number label (generated by GitHub),
the title and the description of a PR respectively.
Moreover, field \texttt{pr\_num} can be used to uniquely locate a PR in the addressing space of a project in GitHub.
Fields \texttt{author} and \texttt{created\_at} means a PR is submitted by the GitHub user named \texttt{author}
Fields \texttt{author} and \texttt{created\_at} mean a PR is submitted by the GitHub user named \texttt{author}
at the time of \texttt{created\_at}.
\item For a pull-request in Table \texttt{Pull-request},

View File

@ -86,7 +86,7 @@ The dataset \textit{DupPR} is constructed through a rigorous process
which involves careful manual verifying.
Thus, it can act as a ground truth to train and evaluate intelligent models (\eg classification model).
%Actually,
Here, we conducte a preliminary experiments to automatically identify duplicate PRs.
Here, we conduct a preliminary experiment to automatically identify duplicate PRs.
%we have conducted experiments to automatically identify duplicate PRs at submission time.
By employing natural language processing and calculating the overlap of changes,
we measure the similarity between two PRs, and then

View File

@ -25,7 +25,7 @@ this dataset still has several limitations.
The studied projects are only a relatively small proportion of all the projects hosted in GitHub.
We plan to enrich the dataset by taking more projects into consideration.
In addition,
identification rules are extracted base on sampled comments
identification rules are extracted based on sampled comments
and therefore the set of rules might be incomplete
which would result in false negatives in the dataset.
In future work,