done all
This commit is contained in:
parent
dd3f49fe06
commit
b605b290f8
|
@ -2,7 +2,7 @@
|
|||
|
||||
\begin{abstract}
|
||||
In GitHub,
|
||||
the Pull-based development model enables community contributors to collaborate in a more efficient way.
|
||||
the pull-based development model enables community contributors to collaborate in a more efficient way.
|
||||
However, the distributed and parallel characteristics of this model
|
||||
% carry contributors a potential risk of submitting duplicate pull-requests.
|
||||
pose a potential risk for developers to submit duplicate pull-requests (PRs),
|
||||
|
|
|
@ -66,8 +66,8 @@ which would guarantee the quality of this dataset.
|
|||
%The dataset and the source code used to recreate it is available
|
||||
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||||
%Based on this dataset, the following interesting research can be feasible.
|
||||
We make the dataset and the source code available online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||||
, in hope it will foster more interest in the following studies.
|
||||
We make the dataset and the source code available online,~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
|
||||
in hope it will foster more interest in the following studies.
|
||||
%which enables the researchers
|
||||
|
||||
\begin{itemize}
|
||||
|
|
12
2-methd.tex
12
2-methd.tex
|
@ -50,7 +50,7 @@ More details can be found in the released dataset.
|
|||
% when they come across duplicate PRs.
|
||||
|
||||
Unlike Stack Overflow,
|
||||
which indicates duplicate posts with a tag ``[duplicate]'' at the end of question titles,
|
||||
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
|
||||
GitHub provides no explicit and unified mechanism to indicate duplicates PR.
|
||||
Although reviewers are encouraged to use the pre-defined
|
||||
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
|
||||
|
@ -123,11 +123,11 @@ In the above example comments,
|
|||
are all the typical indicative phrases.
|
||||
Together with PR references,
|
||||
these indicative phrases can be used to compose the identification rules.
|
||||
An identification rule is actually a regular expression
|
||||
which can be applied to match comment text to identify duplicate relation.
|
||||
The following items are some simplified identification rules
|
||||
and the complete set of rules can be found
|
||||
online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}.
|
||||
An identification rule can be implemented as a regular expression
|
||||
which is applied to match comment text to identify duplicate relations.
|
||||
The following items are some simplified rules,
|
||||
and the complete set of our rules can be found
|
||||
online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}
|
||||
|
||||
|
||||
\begin{itemize}
|
||||
|
|
112
4-applct.tex
112
4-applct.tex
|
@ -2,79 +2,80 @@
|
|||
|
||||
\section{Applications}
|
||||
|
||||
To foster more interest
|
||||
To foster more interest in studying pull-based development based on this dataset
|
||||
(maybe sometimes together with GHTorrent~\cite{Gousios2012GHTorrent} and GitHub API),
|
||||
we present some of our preliminary investigations.
|
||||
|
||||
To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}
|
||||
(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
|
||||
we present some of the preliminary applications.
|
||||
%(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
|
||||
%we present some of the preliminary applications.
|
||||
%To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}
|
||||
|
||||
|
||||
\subsection{Detection latency \& redundant effort}
|
||||
First,
|
||||
we explored the \textit{detection latency} of duplicates.
|
||||
In the paper,
|
||||
we have explored the \textit{detection latency} of duplicates.
|
||||
In this paper,
|
||||
detection latency is used to measure how long it takes to detect the duplicate relation between two PRs.
|
||||
It is defined as the time period
|
||||
from the submission time of a new PR
|
||||
to the time when the duplicate relation between it and a historical PR is identified.
|
||||
For each item in table \texttt{Duplicate},
|
||||
the property \texttt{created\_at} of \texttt{dup\_pr} in table \texttt{Pull-request} is used as the submission time,
|
||||
and the property \texttt{created\_at} of $idn\_cmt$ in table \texttt{Comment} is used as the identification time.
|
||||
We calculated the detection latency of all the duplicates in our dataset
|
||||
and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
|
||||
The figure presents that
|
||||
37.0\% (865) duplicates are detected after long latency which is more than one day.
|
||||
|
||||
and the property \texttt{created\_at} of \texttt{idn\_cmt} in table \texttt{Comment} is used as the identification time.
|
||||
Figure~\ref{fig:delay_time_bar} shows the statistical distribution of the detection latency based on our dataset.
|
||||
%We calculate the detection latency of all the duplicates in our dataset,
|
||||
%and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
|
||||
There are nearly 21\% (486) duplicates are detected after a relative long latency (more than one week).
|
||||
Those PRs probably have already consumed a lot of unnecessary manpower
|
||||
and computational resources (\eg continuous integration~\cite{yu2015wait,ci2015}).
|
||||
% 1,474 (63\%) duplicates are identified less than one day,
|
||||
% while 865 (37.0\%) duplicates are detected after longer latency which is more than one day.
|
||||
% \hl{This .....}
|
||||
|
||||
In addition,
|
||||
we focus on how much redundant review effort has been costed %during the detection latency
|
||||
by calculating the number of different reviewers and comments
|
||||
that are involved in the evolution process of duplicate PRs.
|
||||
According to our statistics,
|
||||
there are on average 2.5 reviewers participating in the redundant review discussions
|
||||
and 5.2 review comments are generated before the duplicate relation is identified.
|
||||
|
||||
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
\centering
|
||||
\includegraphics[width=0.45\textwidth]{figs/delay_time_bar.png}
|
||||
\caption{Distribution of detection latency}
|
||||
\label{fig:delay_time_bar}
|
||||
\vspace{-0.3cm}
|
||||
\label{fig:delay_time_bar}
|
||||
%\vspace{-0.2cm}
|
||||
\end{figure}
|
||||
|
||||
|
||||
% 20180125注释
|
||||
% \subsection{Redundant effort}
|
||||
|
||||
In addition,
|
||||
we explored how much effort has been redundant during the detection latency.
|
||||
Specifically,
|
||||
we calculated the number of reviewers and the number of comments
|
||||
that are involved in the review of duplicate PRs.
|
||||
According to our statistics,
|
||||
there are on average 2.5 reviewers participating in the redundant review discussions
|
||||
and 5.2 review comments are generated before the duplicate relation is identified.
|
||||
|
||||
|
||||
% 添加 落选的 和 被选的 一个统计分析(如修改文件数、代码行数等、提交者的地位)
|
||||
\subsection{Preference of choice}
|
||||
For a pair of duplicate pull-requests,
|
||||
For each pair of duplicate PRs,
|
||||
reviewers have to make a choice between them
|
||||
or, in rare cases, make a combination of them.
|
||||
We tried to figure out reviewers' considerations when they prefer one pull-request to the other one
|
||||
and recognized some prominent indicators of preferred pull-requests:
|
||||
(a) correct implementation,
|
||||
(b) early submission (\ie first-come first-merge),
|
||||
(c) better implementation (\eg less changed codes or better performance),
|
||||
or, in rare cases, make a combination.
|
||||
We have tried to figure out the reasons why integrators would prefer one pull-request to the other one.
|
||||
In brief, a winner PR contains some of prominent indicators:
|
||||
%and recognized some prominent indicators of preferred pull-requests:
|
||||
(a)~correct implementation;
|
||||
(b)~early submission (\ie first-come first-merge);
|
||||
(c)~better implementation (\eg less changed codes or better performance);
|
||||
%%%%这里加一个显著性检验
|
||||
(d) providing test of changed codes,
|
||||
and (e)submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
|
||||
(d)~providing test of changed codes;
|
||||
and (e)~submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
|
||||
Obviously,
|
||||
there are other factors that can affect reviewers' preference of choice,
|
||||
and we will conduct further research on this topic and analyze the significance of these factors.
|
||||
We would also investigate the effectiveness of the current practices of choice,
|
||||
for example we want to figure out
|
||||
whether the preference for new contributors would introduce potential issues
|
||||
and we will conduct further research on this topic and analyze the influences of these factors.
|
||||
We would also investigate the effectiveness of the current practices of choice.
|
||||
For example, we are studying
|
||||
whether the preference for new contributors would increase the probability of introducing potential bugs,
|
||||
and whether it is necessary to dynamically adjust the strategy of choice
|
||||
according to the development status of a project.
|
||||
|
||||
|
||||
% 把test 和 src 分开来讲。
|
||||
% 用户的角色。
|
||||
|
||||
|
@ -82,25 +83,30 @@ according to the development status of a project.
|
|||
% 第三点 training and evaluate(ground truth)
|
||||
\subsection{Training \& evaluating models}
|
||||
The dataset \textit{DupPR} is constructed through a rigorous process
|
||||
which involves careful manual verifying,
|
||||
and therefore it can act as a ground truth to train and evaluate models.
|
||||
Actually,
|
||||
we conducted experiments to automatically identify duplicate PRs at submission time.
|
||||
which involves careful manual verifying.
|
||||
Thus, it can act as a ground truth to train and evaluate intelligent models (\eg classification model).
|
||||
%Actually,
|
||||
Here, we conducte a preliminary experiments to automatically identify duplicate PRs.
|
||||
%we have conducted experiments to automatically identify duplicate PRs at submission time.
|
||||
By employing natural language processing and calculating the overlap of changes,
|
||||
we tried to measure the similarity between two PRs and
|
||||
return a candidate list of \textit{k} historical PRs that are most similar with the submitted PR.
|
||||
we measure the similarity between two PRs, and then
|
||||
return a candidate list of top-\textit{k} historical PRs that are most similar with the submitted PR.
|
||||
We use half of \textit{DupPR} to train an automatic detection model
|
||||
and use the other half to evaluate the performance of the model.
|
||||
For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
|
||||
the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
|
||||
To elaborate the performance of the model,
|
||||
Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of identification result. %~\cite{Runeson2007}
|
||||
When the size of candidate list is set to be 20,
|
||||
the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.
|
||||
and use the rest to evaluate its performance.
|
||||
Figure~\ref{fig:detect} shows the identification results measured by \textit{recall-rate@k},
|
||||
which can achive nearly 70\% when the size of candidate list is set to be 20.
|
||||
%For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
|
||||
%the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
|
||||
%To elaborate the performance of the model,
|
||||
%Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of the identification results. %~\cite{Runeson2007}
|
||||
%When the size of candidate list is set to be 20,
|
||||
%the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.
|
||||
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
\includegraphics[width=0.45\textwidth]{figs/detect.png}
|
||||
\caption{Performance of the automatic detection model}
|
||||
\label{fig:detect}
|
||||
\vspace{-0.2cm}
|
||||
\label{fig:detect}
|
||||
\vspace{-0.2cm}
|
||||
\end{figure}
|
||||
|
|
30
5-cons.tex
30
5-cons.tex
|
@ -1,21 +1,21 @@
|
|||
%!TEX root = main.tex
|
||||
|
||||
\section{Conclusion}
|
||||
The distributed and parallel characteristics of Pull-based development model
|
||||
The distributed and parallel characteristics of pull-based development model
|
||||
on one hand enable community users to collaborative in a more efficient and effective way,
|
||||
but on the other hand carry contributors a potential risk of submitting duplicate PRs.
|
||||
|
||||
In this paper,
|
||||
we presented a dataset containing 2,323 pairs of duplicate PRs,
|
||||
we present a large dataset containing 2,323 pairs of duplicate PRs,
|
||||
collected from 26 popular open source projects hosted in GitHub.
|
||||
The dataset includes
|
||||
the duplicate relations between PRs,
|
||||
the meta data and text content of PRs and review comments,
|
||||
and the basic information of studied projects
|
||||
The dataset includes duplicate relations between PRs,
|
||||
the meta-data of PRs and reviews (\eg creation time, text content and author),
|
||||
and the basic information of the studied projects.
|
||||
|
||||
The dataset allows us to conduct empirical studies to
|
||||
understand the outcomes and issues of duplicates,
|
||||
explore the underlying causes and efficient prevention of duplicates,
|
||||
and analyze the practices and challenges of reviewers in dealing with duplicates.
|
||||
explore the underlying causes and the corresponding prevention strategies,
|
||||
and analyze the practices and challenges of integrators and contributors in dealing with duplicates.
|
||||
Moreover,
|
||||
this dataset enables us to train and evaluate automatic models that can
|
||||
detect duplicate historical PRs for a newly submitted PR.
|
||||
|
@ -24,15 +24,15 @@ However,
|
|||
this dataset still has several limitations.
|
||||
The studied projects are only a relatively small proportion of all the projects hosted in GitHub.
|
||||
We plan to enrich the dataset by taking more projects into consideration.
|
||||
In addition,
|
||||
identification rules are extracted base on sampled comments
|
||||
and therefore the set of rules might be incomplete
|
||||
In addition,
|
||||
identification rules are extracted base on sampled comments
|
||||
and therefore the set of rules might be incomplete
|
||||
which would result in false negatives in the dataset.
|
||||
In the future,
|
||||
we would like to continually improve the identification method.
|
||||
At the meantime,
|
||||
In future work,
|
||||
we would like to continually improve the identification method.
|
||||
At the meantime,
|
||||
by sharing both the dataset and guidelines for recreation,
|
||||
we intend to encourage other researchers to validate and extend the dataset.
|
||||
we intend to encourage other researchers to validate and extend the dataset.
|
||||
|
||||
% By sharing both the dataset and guidelines for recreation,
|
||||
% we intend to encourage other researchers to validate and extend the dataset,
|
||||
|
|
8
ref.bib
8
ref.bib
|
@ -121,3 +121,11 @@
|
|||
pages={367--371},
|
||||
year={2015}
|
||||
}
|
||||
|
||||
@inproceedings{ci2015,
|
||||
title={Quality and productivity outcomes relating to continuous integration in GitHub},
|
||||
author={Vasilescu, Bogdan and Yu, Yue and Wang, Huaimin and Devanbu, Premkumar and Filkov, Vladimir},
|
||||
booktitle={Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, FSE},
|
||||
pages={805--816},
|
||||
year={2015}
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue