This commit is contained in:
Fisher Yu 2018-03-11 23:39:36 +08:00
parent dd3f49fe06
commit b605b290f8
6 changed files with 91 additions and 77 deletions

View File

@ -2,7 +2,7 @@
\begin{abstract}
In GitHub,
the Pull-based development model enables community contributors to collaborate in a more efficient way.
the pull-based development model enables community contributors to collaborate in a more efficient way.
However, the distributed and parallel characteristics of this model
% carry contributors a potential risk of submitting duplicate pull-requests.
pose a potential risk for developers to submit duplicate pull-requests (PRs),

View File

@ -66,8 +66,8 @@ which would guarantee the quality of this dataset.
%The dataset and the source code used to recreate it is available
%online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
%Based on this dataset, the following interesting research can be feasible.
We make the dataset and the source code available online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
, in hope it will foster more interest in the following studies.
We make the dataset and the source code available online,~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
in hope it will foster more interest in the following studies.
%which enables the researchers
\begin{itemize}

View File

@ -50,7 +50,7 @@ More details can be found in the released dataset.
% when they come across duplicate PRs.
Unlike Stack Overflow,
which indicates duplicate posts with a tag ``[duplicate]'' at the end of question titles,
which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
GitHub provides no explicit and unified mechanism to indicate duplicates PR.
Although reviewers are encouraged to use the pre-defined
reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
@ -123,11 +123,11 @@ In the above example comments,
are all the typical indicative phrases.
Together with PR references,
these indicative phrases can be used to compose the identification rules.
An identification rule is actually a regular expression
which can be applied to match comment text to identify duplicate relation.
The following items are some simplified identification rules
and the complete set of rules can be found
online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}.
An identification rule can be implemented as a regular expression
which is applied to match comment text to identify duplicate relations.
The following items are some simplified rules,
and the complete set of our rules can be found
online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}
\begin{itemize}

View File

@ -2,79 +2,80 @@
\section{Applications}
To foster more interest
To foster more interest in studying pull-based development based on this dataset
(maybe sometimes together with GHTorrent~\cite{Gousios2012GHTorrent} and GitHub API),
we present some of our preliminary investigations.
To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}
(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
we present some of the preliminary applications.
%(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
%we present some of the preliminary applications.
%To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}
\subsection{Detection latency \& redundant effort}
First,
we explored the \textit{detection latency} of duplicates.
In the paper,
we have explored the \textit{detection latency} of duplicates.
In this paper,
detection latency is used to measure how long it takes to detect the duplicate relation between two PRs.
It is defined as the time period
from the submission time of a new PR
to the time when the duplicate relation between it and a historical PR is identified.
For each item in table \texttt{Duplicate},
the property \texttt{created\_at} of \texttt{dup\_pr} in table \texttt{Pull-request} is used as the submission time,
and the property \texttt{created\_at} of $idn\_cmt$ in table \texttt{Comment} is used as the identification time.
We calculated the detection latency of all the duplicates in our dataset
and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
The figure presents that
37.0\% (865) duplicates are detected after long latency which is more than one day.
and the property \texttt{created\_at} of \texttt{idn\_cmt} in table \texttt{Comment} is used as the identification time.
Figure~\ref{fig:delay_time_bar} shows the statistical distribution of the detection latency based on our dataset.
%We calculate the detection latency of all the duplicates in our dataset,
%and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
There are nearly 21\% (486) duplicates are detected after a relative long latency (more than one week).
Those PRs probably have already consumed a lot of unnecessary manpower
and computational resources (\eg continuous integration~\cite{yu2015wait,ci2015}).
% 1,474 (63\%) duplicates are identified less than one day,
% while 865 (37.0\%) duplicates are detected after longer latency which is more than one day.
% \hl{This .....}
In addition,
we focus on how much redundant review effort has been costed %during the detection latency
by calculating the number of different reviewers and comments
that are involved in the evolution process of duplicate PRs.
According to our statistics,
there are on average 2.5 reviewers participating in the redundant review discussions
and 5.2 review comments are generated before the duplicate relation is identified.
\begin{figure}[ht]
\centering
\centering
\includegraphics[width=0.45\textwidth]{figs/delay_time_bar.png}
\caption{Distribution of detection latency}
\label{fig:delay_time_bar}
\vspace{-0.3cm}
\label{fig:delay_time_bar}
%\vspace{-0.2cm}
\end{figure}
% 20180125注释
% \subsection{Redundant effort}
In addition,
we explored how much effort has been redundant during the detection latency.
Specifically,
we calculated the number of reviewers and the number of comments
that are involved in the review of duplicate PRs.
According to our statistics,
there are on average 2.5 reviewers participating in the redundant review discussions
and 5.2 review comments are generated before the duplicate relation is identified.
% 添加 落选的 和 被选的 一个统计分析(如修改文件数、代码行数等、提交者的地位)
\subsection{Preference of choice}
For a pair of duplicate pull-requests,
For each pair of duplicate PRs,
reviewers have to make a choice between them
or, in rare cases, make a combination of them.
We tried to figure out reviewers' considerations when they prefer one pull-request to the other one
and recognized some prominent indicators of preferred pull-requests:
(a) correct implementation,
(b) early submission (\ie first-come first-merge),
(c) better implementation (\eg less changed codes or better performance),
or, in rare cases, make a combination.
We have tried to figure out the reasons why integrators would prefer one pull-request to the other one.
In brief, a winner PR contains some of prominent indicators:
%and recognized some prominent indicators of preferred pull-requests:
(a)~correct implementation;
(b)~early submission (\ie first-come first-merge);
(c)~better implementation (\eg less changed codes or better performance);
%%%%这里加一个显著性检验
(d) providing test of changed codes,
and (e)submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
(d)~providing test of changed codes;
and (e)~submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
Obviously,
there are other factors that can affect reviewers' preference of choice,
and we will conduct further research on this topic and analyze the significance of these factors.
We would also investigate the effectiveness of the current practices of choice,
for example we want to figure out
whether the preference for new contributors would introduce potential issues
and we will conduct further research on this topic and analyze the influences of these factors.
We would also investigate the effectiveness of the current practices of choice.
For example, we are studying
whether the preference for new contributors would increase the probability of introducing potential bugs,
and whether it is necessary to dynamically adjust the strategy of choice
according to the development status of a project.
% 把test 和 src 分开来讲。
% 用户的角色。
@ -82,25 +83,30 @@ according to the development status of a project.
% 第三点 training and evaluate(ground truth)
\subsection{Training \& evaluating models}
The dataset \textit{DupPR} is constructed through a rigorous process
which involves careful manual verifying,
and therefore it can act as a ground truth to train and evaluate models.
Actually,
we conducted experiments to automatically identify duplicate PRs at submission time.
which involves careful manual verifying.
Thus, it can act as a ground truth to train and evaluate intelligent models (\eg classification model).
%Actually,
Here, we conducte a preliminary experiments to automatically identify duplicate PRs.
%we have conducted experiments to automatically identify duplicate PRs at submission time.
By employing natural language processing and calculating the overlap of changes,
we tried to measure the similarity between two PRs and
return a candidate list of \textit{k} historical PRs that are most similar with the submitted PR.
we measure the similarity between two PRs, and then
return a candidate list of top-\textit{k} historical PRs that are most similar with the submitted PR.
We use half of \textit{DupPR} to train an automatic detection model
and use the other half to evaluate the performance of the model.
For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
To elaborate the performance of the model,
Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of identification result. %~\cite{Runeson2007}
When the size of candidate list is set to be 20,
the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.
and use the rest to evaluate its performance.
Figure~\ref{fig:detect} shows the identification results measured by \textit{recall-rate@k},
which can achive nearly 70\% when the size of candidate list is set to be 20.
%For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
%the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
%To elaborate the performance of the model,
%Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of the identification results. %~\cite{Runeson2007}
%When the size of candidate list is set to be 20,
%the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.
\begin{figure}[ht]
\centering
\includegraphics[width=0.45\textwidth]{figs/detect.png}
\caption{Performance of the automatic detection model}
\label{fig:detect}
\vspace{-0.2cm}
\label{fig:detect}
\vspace{-0.2cm}
\end{figure}

View File

@ -1,21 +1,21 @@
%!TEX root = main.tex
\section{Conclusion}
The distributed and parallel characteristics of Pull-based development model
The distributed and parallel characteristics of pull-based development model
on one hand enable community users to collaborative in a more efficient and effective way,
but on the other hand carry contributors a potential risk of submitting duplicate PRs.
In this paper,
we presented a dataset containing 2,323 pairs of duplicate PRs,
we present a large dataset containing 2,323 pairs of duplicate PRs,
collected from 26 popular open source projects hosted in GitHub.
The dataset includes
the duplicate relations between PRs,
the meta data and text content of PRs and review comments,
and the basic information of studied projects
The dataset includes duplicate relations between PRs,
the meta-data of PRs and reviews (\eg creation time, text content and author),
and the basic information of the studied projects.
The dataset allows us to conduct empirical studies to
understand the outcomes and issues of duplicates,
explore the underlying causes and efficient prevention of duplicates,
and analyze the practices and challenges of reviewers in dealing with duplicates.
explore the underlying causes and the corresponding prevention strategies,
and analyze the practices and challenges of integrators and contributors in dealing with duplicates.
Moreover,
this dataset enables us to train and evaluate automatic models that can
detect duplicate historical PRs for a newly submitted PR.
@ -24,15 +24,15 @@ However,
this dataset still has several limitations.
The studied projects are only a relatively small proportion of all the projects hosted in GitHub.
We plan to enrich the dataset by taking more projects into consideration.
In addition,
identification rules are extracted base on sampled comments
and therefore the set of rules might be incomplete
In addition,
identification rules are extracted base on sampled comments
and therefore the set of rules might be incomplete
which would result in false negatives in the dataset.
In the future,
we would like to continually improve the identification method.
At the meantime,
In future work,
we would like to continually improve the identification method.
At the meantime,
by sharing both the dataset and guidelines for recreation,
we intend to encourage other researchers to validate and extend the dataset.
we intend to encourage other researchers to validate and extend the dataset.
% By sharing both the dataset and guidelines for recreation,
% we intend to encourage other researchers to validate and extend the dataset,

View File

@ -121,3 +121,11 @@
pages={367--371},
year={2015}
}
@inproceedings{ci2015,
title={Quality and productivity outcomes relating to continuous integration in GitHub},
author={Vasilescu, Bogdan and Yu, Yue and Wang, Huaimin and Devanbu, Premkumar and Filkov, Vladimir},
booktitle={Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, FSE},
pages={805--816},
year={2015}
}