through again
This commit is contained in:
parent
b605b290f8
commit
f3053b5fe7
|
@ -12,8 +12,8 @@ which increase the extra cost of project maintenance.
|
|||
%for both reviewers and contributors.
|
||||
To facilitate the further studies to better understand and solve the issues
|
||||
introduced by duplicate PRs,
|
||||
we construct a large dataset of historical duplicate extracted from
|
||||
PRs 26 popular open source projects in GitHub by using a semi-automatic approach.
|
||||
we construct a large dataset of historical duplicate PRs extracted from
|
||||
26 popular open source projects in GitHub by using a semi-automatic approach.
|
||||
Furthermore, we present some preliminary applications
|
||||
to illustrate how further researches can be conducted based on this dataset.
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
% 在GitHub 基于PR的贡献流程
|
||||
In GiHub,
|
||||
pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
|
||||
the pull-based mechanism~\cite{GZ14,Gousios:2014,yu2015wait}
|
||||
lowers the contribution entry for community developers and
|
||||
prompts the development and evolution of numerous open source software projects.
|
||||
Any contributor can fork (\ie clone) a repository
|
||||
|
@ -61,7 +61,7 @@ we constructed a large dataset of historical duplicate PRs
|
|||
(called \textit{DupPR})
|
||||
extracted from 26 open source projects in GitHub.
|
||||
Each pair of duplicate PRs in \textit{DupPR}
|
||||
has been manually verified after a automatic identification process,
|
||||
has been manually verified after an automatic identification process,
|
||||
which would guarantee the quality of this dataset.
|
||||
%The dataset and the source code used to recreate it is available
|
||||
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||||
|
|
|
@ -10,7 +10,7 @@ involving 12 programming languages and various application domains
|
|||
(\eg web-application framework, database and scientific computing library).
|
||||
Table~\ref{tab:basic_dataset} presents some of statistical characteristics about
|
||||
the project scale and popularity, \eg the number of PRs, contributors and forks,
|
||||
which shows that they have attracted plenty of attentions from the community.
|
||||
which show that they have attracted plenty of attentions from the community.
|
||||
Also, we can assure that the studied projects have full-fledged and heavy usage of PR mechanism
|
||||
(minimum number of PRs is 5,050).
|
||||
More details can be found in the released dataset.
|
||||
|
@ -51,7 +51,7 @@ More details can be found in the released dataset.
|
|||
|
||||
Unlike Stack Overflow,
|
||||
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
|
||||
GitHub provides no explicit and unified mechanism to indicate duplicates PR.
|
||||
GitHub provides no explicit and unified mechanism to indicate duplicate PRs.
|
||||
Although reviewers are encouraged to use the pre-defined
|
||||
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
|
||||
when they intend to point out a PR is duplicate to another one,
|
||||
|
|
|
@ -41,7 +41,7 @@ and the fields in these tables are defined as follows.
|
|||
and \texttt{pr\_num}, \texttt{title} and \texttt{description} represent the number label (generated by GitHub),
|
||||
the title and the description of a PR respectively.
|
||||
Moreover, field \texttt{pr\_num} can be used to uniquely locate a PR in the addressing space of a project in GitHub.
|
||||
Fields \texttt{author} and \texttt{created\_at} means a PR is submitted by the GitHub user named \texttt{author}
|
||||
Fields \texttt{author} and \texttt{created\_at} mean a PR is submitted by the GitHub user named \texttt{author}
|
||||
at the time of \texttt{created\_at}.
|
||||
|
||||
\item For a pull-request in Table \texttt{Pull-request},
|
||||
|
|
|
@ -86,7 +86,7 @@ The dataset \textit{DupPR} is constructed through a rigorous process
|
|||
which involves careful manual verifying.
|
||||
Thus, it can act as a ground truth to train and evaluate intelligent models (\eg classification model).
|
||||
%Actually,
|
||||
Here, we conducte a preliminary experiments to automatically identify duplicate PRs.
|
||||
Here, we conduct a preliminary experiment to automatically identify duplicate PRs.
|
||||
%we have conducted experiments to automatically identify duplicate PRs at submission time.
|
||||
By employing natural language processing and calculating the overlap of changes,
|
||||
we measure the similarity between two PRs, and then
|
||||
|
|
|
@ -25,7 +25,7 @@ this dataset still has several limitations.
|
|||
The studied projects are only a relatively small proportion of all the projects hosted in GitHub.
|
||||
We plan to enrich the dataset by taking more projects into consideration.
|
||||
In addition,
|
||||
identification rules are extracted base on sampled comments
|
||||
identification rules are extracted based on sampled comments
|
||||
and therefore the set of rules might be incomplete
|
||||
which would result in false negatives in the dataset.
|
||||
In future work,
|
||||
|
|
Loading…
Reference in New Issue