done all

2018-03-11 23:39:36 +08:00 · 2018-03-11 23:39:36 +08:00 · b605b290f8
parent dd3f49fe06
commit b605b290f8
6 changed files with 91 additions and 77 deletions
--- a/0-abstract.tex
+++ b/0-abstract.tex
@ -2,7 +2,7 @@

 \begin{abstract}
 In GitHub,
-the Pull-based development model enables community contributors to collaborate in a more efficient way.
+the pull-based development model enables community contributors to collaborate in a more efficient way.
 However, the distributed and parallel characteristics of this model
 % carry contributors a potential risk of submitting duplicate pull-requests.
 pose a potential risk for developers to submit duplicate pull-requests (PRs),
--- a/1-intro.tex
+++ b/1-intro.tex
@ -66,8 +66,8 @@ which would guarantee the quality of this dataset.
 %The dataset and the source code used to recreate it is available
 %online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
 %Based on this dataset, the following interesting research can be feasible.
-We make the dataset and the source code available online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}}
-, in hope it will foster more interest in the following studies.
+We make the dataset and the source code available online,~\footnote{\url{https://github.com/whystar/MSR2018-DupPR}} 
+in hope it will foster more interest in the following studies.
 %which enables the researchers

 \begin{itemize}
--- a/2-methd.tex
+++ b/2-methd.tex
@ -50,7 +50,7 @@ More details can be found in the released dataset.
 % when they come across duplicate PRs.

 Unlike Stack Overflow,
-which indicates duplicate posts with a tag ``[duplicate]'' at the end of question titles,
+which indicates duplicate posts with a signal ``[duplicate]'' at the end of question titles,
 GitHub provides no explicit and unified mechanism to indicate duplicates PR.
 Although reviewers are encouraged to use the pre-defined
 reply template~\footnote{\url{https://help.github.com/articles/about-duplicate-issues-and-pull-requests}}
@ -123,11 +123,11 @@ In the above example comments,
 are all the typical indicative phrases.
 Together with PR references,
 these indicative phrases can be used to compose the identification rules.
-An identification rule is actually a regular expression
-which can be applied to match comment text to identify duplicate relation.
-The following items are some simplified identification rules
-and the complete set of rules can be found
-online~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}.
+An identification rule can be implemented as a regular expression
+which is applied to match comment text to identify duplicate relations.
+The following items are some simplified rules,
+and the complete set of our rules can be found
+online.~\footnote{\url{https://github.com/whystar/MSR2018-DupPR/blob/master/code/rules.py}}


 \begin{itemize}
--- a/4-applct.tex
+++ b/4-applct.tex
@ -2,79 +2,80 @@

 \section{Applications}

-To foster more interest
+To foster more interest in studying pull-based development based on this dataset
+(maybe sometimes together with GHTorrent~\cite{Gousios2012GHTorrent} and GitHub API),
+we present some of our preliminary investigations.

-To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}
-(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
-we present some of the preliminary applications.
+%(maybe sometimes together with GHTorrent\cite{Gousios2012GHTorrent} and GitHub API),
+%we present some of the preliminary applications.
+%To illustrate the potential research that can be conducted based on the dataset \textit{DupPR}


 \subsection{Detection latency \& redundant effort}
 First,
-we  explored the \textit{detection latency} of duplicates.
-In the paper,
+we have explored the \textit{detection latency} of duplicates.
+In this paper,
 detection latency is used to measure how long it takes to detect the duplicate relation between two PRs.
 It is defined as the time period
 from the submission time of a new PR
 to the time when the duplicate relation between it and a historical PR is identified.
 For each item in table \texttt{Duplicate},
 the property \texttt{created\_at} of \texttt{dup\_pr} in table \texttt{Pull-request} is used as the submission time,
-and the property \texttt{created\_at} of $idn\_cmt$ in table \texttt{Comment} is used as the identification time.
-We calculated the detection latency of all the duplicates in our dataset
-and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
-The figure presents that
-37.0\% (865) duplicates are detected after long latency which is more than one day.
-
+and the property \texttt{created\_at} of \texttt{idn\_cmt} in table \texttt{Comment} is used as the identification time.
+Figure~\ref{fig:delay_time_bar} shows the statistical distribution of the detection latency based on our dataset.
+%We calculate the detection latency of all the duplicates in our dataset,
+%and the statistic result is shown in Figure~\ref{fig:delay_time_bar}.
+There are nearly 21\% (486) duplicates are detected after a relative long latency (more than one week).
+Those PRs probably have already consumed a lot of unnecessary manpower
+and computational resources (\eg continuous integration~\cite{yu2015wait,ci2015}).
 % 1,474 (63\%) duplicates are identified less than one day,
 % while 865 (37.0\%) duplicates are detected after longer latency which is more than one day.
 % \hl{This .....}
-
+In addition,
+we focus on how much redundant review effort has been costed %during the detection latency
+by calculating the number of different reviewers and comments
+that are involved in the evolution process of duplicate PRs.
+According to our statistics,
+there are on average 2.5 reviewers participating in the redundant review discussions
+and 5.2 review comments are generated before the duplicate relation is identified.


 \begin{figure}[ht]
- 	\centering
+  \centering
 	\includegraphics[width=0.45\textwidth]{figs/delay_time_bar.png}
 	\caption{Distribution of detection latency}
-	\label{fig:delay_time_bar}
+  \vspace{-0.3cm}
+  \label{fig:delay_time_bar}
+  %\vspace{-0.2cm}
 \end{figure}


 %  20180125注释
 % \subsection{Redundant effort}

-In addition,
-we explored how much effort has been redundant during the detection latency.
-Specifically,
-we calculated the number of reviewers and the number of comments
-that are involved in the review of duplicate PRs.
-According to our statistics,
-there are on average 2.5 reviewers participating in the redundant review discussions
-and 5.2 review comments are generated before the duplicate relation is identified.
-
-
 % 添加 落选的 和 被选的 一个统计分析（如修改文件数、代码行数等、提交者的地位）
 \subsection{Preference of choice}
-For a pair of duplicate pull-requests,
+For each pair of duplicate PRs,
 reviewers have to make a choice between them
-or, in rare cases, make a combination of them.
-We tried to figure out reviewers' considerations when they prefer one pull-request to the other one
-and recognized some prominent indicators of preferred pull-requests:
-(a) correct implementation,
-(b) early submission (\ie first-come first-merge),
-(c) better implementation (\eg less changed codes or better performance),
+or, in rare cases, make a combination.
+We have tried to figure out the reasons why integrators would prefer one pull-request to the other one.
+In brief, a winner PR contains some of prominent indicators:
+%and recognized some prominent indicators of preferred pull-requests:
+(a)~correct implementation;
+(b)~early submission (\ie first-come first-merge);
+(c)~better implementation (\eg less changed codes or better performance);
 %%%%这里加一个显著性检验
-(d) providing test of changed codes,
-and (e)submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
+(d)~providing test of changed codes;
+and (e)~submitted by new contributors (reviewers prefer new contributors to encourage them to continuously contribute).
 Obviously,
 there are other factors that can affect reviewers' preference of choice,
-and we will conduct further research on this topic and analyze the significance of these factors.
-We would also investigate the effectiveness of the current practices of choice,
-for example we want to figure out
-whether the preference for  new contributors would introduce potential issues
+and we will conduct further research on this topic and analyze the influences of these factors.
+We would also investigate the effectiveness of the current practices of choice.
+For example, we are studying
+whether the preference for new contributors would increase the probability of introducing potential bugs,
 and whether it is necessary to dynamically adjust the strategy of choice
 according to the development status of a project.

-
 % 把test 和 src 分开来讲。
 % 用户的角色。

@ -82,25 +83,30 @@ according to the development status of a project.
 % 第三点 training and evaluate(ground truth)
 \subsection{Training \& evaluating models}
 The dataset \textit{DupPR} is constructed through a rigorous process
-which involves careful manual verifying,
-and therefore it can act as a ground truth to train and evaluate models.
-Actually,
-we conducted experiments to automatically identify duplicate PRs at submission time.
+which involves careful manual verifying.
+Thus, it can act as a ground truth to train and evaluate intelligent models (\eg classification model).
+%Actually,
+Here, we conducte a preliminary experiments to automatically identify duplicate PRs.
+%we have conducted experiments to automatically identify duplicate PRs at submission time.
 By employing natural language processing and calculating the overlap of changes,
-we tried to measure the similarity between two PRs and
-return a candidate list of \textit{k} historical PRs that are most similar with the submitted PR.
+we measure the similarity between two PRs, and then
+return a candidate list of top-\textit{k} historical PRs that are most similar with the submitted PR.
 We use half of \textit{DupPR} to train an automatic detection model
-and use the other half to evaluate the performance of the model.
-For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
-the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
-To elaborate the performance of the model,
-Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of identification result. %~\cite{Runeson2007}
-When the size of candidate list is set to be 20,
-the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.
+and use the rest to evaluate its performance.
+Figure~\ref{fig:detect} shows the identification results measured by \textit{recall-rate@k},
+which can achive nearly 70\% when the size of candidate list is set to be 20.
+%For a pair of duplicates \texttt{<mst\_pr, dup\_pr>},
+%the model is effective if \texttt{mst\_pr} appears in the list of candidate duplicates of \texttt{dup\_pr}.
+%To elaborate the performance of the model,
+%Figure~\ref{fig:detect} shows the \textit{recall-rate@k} of the identification results. %~\cite{Runeson2007}
+%When the size of candidate list is set to be 20,
+%the automatic detection model is effective for about 70\% duplicate PRs in the test dataset.

 \begin{figure}[ht]
 	\centering
 	\includegraphics[width=0.45\textwidth]{figs/detect.png}
 	\caption{Performance of the automatic detection model}
-	\label{fig:detect}
+  \vspace{-0.2cm}
+  \label{fig:detect}
+  \vspace{-0.2cm}
 \end{figure}
--- a/5-cons.tex
+++ b/5-cons.tex
@ -1,21 +1,21 @@
 %!TEX root = main.tex

 \section{Conclusion}
-The distributed and parallel characteristics of Pull-based development model
+The distributed and parallel characteristics of pull-based development model
 on one hand enable community users to collaborative in a more efficient and effective way,
 but on the other hand carry contributors a potential risk of submitting duplicate PRs.
+
 In this paper,
-we presented a dataset containing 2,323 pairs of duplicate PRs,
+we present a large dataset containing 2,323 pairs of duplicate PRs,
 collected from 26 popular open source projects hosted in GitHub.
-The dataset includes 
-the duplicate relations between PRs, 
-the meta data and text content of PRs and review comments, 
-and the basic information of studied projects
+The dataset includes duplicate relations between PRs,
+the meta-data of PRs and reviews (\eg creation time, text content and author),
+and the basic information of the studied projects.

 The dataset allows us to conduct empirical studies to
 understand the outcomes and issues of duplicates,
-explore the underlying causes and efficient prevention of duplicates,
-and analyze the practices and challenges of reviewers in dealing with duplicates.
+explore the underlying causes and the corresponding prevention strategies,
+and analyze the practices and challenges of integrators and contributors in dealing with duplicates.
 Moreover,
 this dataset enables us to train and evaluate automatic models that can
 detect duplicate historical PRs for a newly submitted PR.
@ -24,15 +24,15 @@ However,
 this dataset still has several limitations.
 The studied projects are only a relatively small proportion of all the projects hosted in GitHub.
 We plan to enrich the dataset by taking more projects into consideration.
-In addition, 
-identification rules are extracted base on sampled comments 
-and therefore the set of rules might be incomplete 
+In addition,
+identification rules are extracted base on sampled comments
+and therefore the set of rules might be incomplete
 which would result in false negatives in the dataset.
-In the future,
-we would like to continually improve the identification method. 
-At the meantime, 
+In future work,
+we would like to continually improve the identification method.
+At the meantime,
 by sharing both the dataset and guidelines for recreation,
-we intend to encourage other researchers to validate and extend the dataset. 
+we intend to encourage other researchers to validate and extend the dataset.

 % By sharing both the dataset and guidelines for recreation,
 % we intend to encourage other researchers to validate and extend the dataset,
--- a/ref.bib
+++ b/ref.bib
@ -121,3 +121,11 @@
  pages={367--371},
  year={2015}
 }
+
+@inproceedings{ci2015,
+  title={Quality and productivity outcomes relating to continuous integration in GitHub},
+  author={Vasilescu, Bogdan and Yu, Yue and Wang, Huaimin and Devanbu, Premkumar and Filkov, Vladimir},
+  booktitle={Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, FSE},
+  pages={805--816},
+  year={2015}
+}