seke_2017/5_comment_distribution.tex

\section{PATTERN MINING}
\label{sec:Pattern}

%利用3C 在大规模数据集上进行分类，从而进行分析
To take advantage of the entire set of review comments when \hl{mining underlining patterns}, we automatically  classified the rest unlabeled comments with the help of \CCC. Based on the larger dataset of labeled review commetns, we further explored the \hl{distribution of review comments on each category and categories co-occurence in pull-reqeust level}.

\subsection{Distribution of Review Comments}

% 1. 共性
	% 1.1. 三个项目有差不多一致的分布趋势
		% 呈现双驼峰趋势
	% 1.2. 着重说哪几个类别多或者少，
		% 缺陷探测多
		% 码风格和测试相关少，一部分是由于开源软件都有一定的开发规范；一方面是由于贡献者在透明环境中很注重自己的名声，尽量不会犯低级错误。
		% 社交互动都是很频繁的

We wanted to understand which type of comment reviewer tend to post when evaluating a pull-request. A variety distribution of review comments on 2-level categories is \hl{counted} and Figure \ref{fig:stas_plot} shows the result which indicates the percentage of each category of comment in the total review comments.
As a gneral view, the distribution patterns over three projects are simlar, which reveal a trend of \hl{``double hump''}... from L2-1 to L2-11.
Significantly, \textit{defect detection(L2-2)} occupies the first place (Rails: 42.5\%, EalsticSearch: 50.6\%, Angular.js: 33.1\%). This are consistent with previous studies\cite{Gousios:2014b,Bacchelli:2013} which state the main purpose of code review is to find defects.

An interesting phenomenan is that although commenting on code style and testing is a less effort-consuming behavior which doesn't require too much understanding for code change, comments of the two categories is relatively less than others. This is somewhat inconsistent with the study\cite{Bacchelli:2013} conducted by Bacchelli \etal on MicroSoft teams, in which they found \textit{code style} (called \textit{code improvement} in their study) is the most frequent outcome of code reviews. After a comparison and analyzing, the reason for the difference, in our opinion, maybe is the development everonment. In GitHub, contributors are in a transparent environment and their activity is visible to everyone\cite{Tsay:2014b,Gousios:2016}. Hence, to build and maitain their good reputation, contributors usually go through their code and prevent making ``stupid mistakes''. What's more, there are enough guidelines provided by OSS for new incoming developers to help them avoid low-level errors related to code style. Social interaction (L2-10\&L2-11) is also found to have a high portion of 25.4\%, 18.0\%, and 29.1\% in Rails, EalsticSearch, and  Angular.js respectively, which reflects the the reviewers' conercon about the interaction during the pull-request assessment process.


% 2. 差别性
	% 2.1. 比如rails 模块复杂、贡献者多，因此经常会出现审阅者指派7
	% 2.2. aj 6,9多
	% 2.3. es 4多 10少

However, there are distribution difference of some categories over the three projects. Rails is complicated framework, which consists of \hl{plenty of mudules and source files}, and compared to other two projects, it attracts more contributors and receives more pull-request. It's  common for pull-request reviewers to assign or recommend other more appropriate reviewers who are more familiar with the changed files. As a result of this, \textit{reviewer assignment} is a more frequent activity in Rails than the other two projects.

[some more\dots]

%各种类型的comment的分布
\begin{figure}[ht]
 	\centering
 	\includegraphics[width=9cm]{resources/stas_plot.png}
 	\caption{Review comment distribution on 2-level categories(!!!!Test!!!!)}
	\label{fig:stas_plot}
\end{figure}


\subsection{Category Co-occurence}
%以pr为单位研究各种类型的comment的分布
We conducted an experiment on all the pull-requests and review comments of the three projects to explore the category co-occurence in pull-request and Figure \ref{fig:co_oc_node_e} illustrates the result. The size of a node indicates the number of pull-requests that receive the corresponding category of comment and the width of a line between two nodes presents the number of their co-occurence.

% 再加一点 总体的介绍

% 具体的几个现象
Looking at the Figure, the most interesing is the most frequent category pair is \textit{defect detection (L2-2)} and \textit{PR acception (L2-4)}.
Intutively, the detection of defect tends to result in PR rejection, which means \textit{defect detection (L2-2)} shoud co-occure with \textit{PR rejection (L2-5)} more frequently.
But like we mentioned, development in pull-based model is a \hl{review-driven} process which consists of iterative committing and reviewing.
Detecting defects in a pull-request doesn't always mean to a rejection. The contributor can update a pull-request with new committs according to reviewers' suggestion and the quality of the pull-request is improved at the meantime.
After multi-round code reviews, a reviewer can approve of the pull-request if they are satisfied with current solution.

% 可以说一下，对贡献者来说，如果审阅者指出了代码缺陷，也不要放弃，只要认真修改， 还是可以获得审阅者的认可的

% gephi: 以10为最大，其他的比例为[0.148, 0.951, 0.233, 0.568, 0.509, 0.299, 0.352, 0.377, 0.367, 1.0, 0.357]
% 0.117, 0.935, 0.237, 0.881, 0.397, 0.398, 0.308, 0.271, 0.478, 1.0, 0.374
\begin{figure}[ht]
 	\centering
 	\includegraphics[width=6cm]{resources/all_co_oc_node_e.png}
 	\caption{Pull-request distribution on comment categories(!!!!add a list for count of pr in each catogery!!!!)}
	\label{fig:co_oc_node_e}
\end{figure}


% \subsection{Contributor Role }
% %核心和非核心的用户  发出  的 接受到  的comment在类型上有啥区别没有
% We further explored whether the contributor's role(core member or outside contributor) affects the distribution of comments from two respectives: comments a contributor posted and received respectively.
% Figure \ref{fig:role_dis} shows the comment distribution difference between core member and outside contributors. From this figure we can concluede that there is no difference

% From Figure \ref{fig:role_dis}, we can see that there is no difference of comment distribution over contributor's role and all the distribution patterns present a \hl{``double hump''} trend which is consistent with the previous global one.
% \begin{figure}[ht]
%  	\centering
%  	\includegraphics[width=9cm]{resources/role_dis.png}
%  	\caption{Comment distribution over contributor's role(!!!!add a list for count of pr in each catogery!!!!)}
% 	\label{fig:role_dis}
% \end{figure}

% \begin{table*}[ht]
% 	\centering
% 	\caption{Dataset of our experiments}
% 	\begin{tabular}{r c c c c c c c}
% 		\toprule
% 		\textbf{Statistic} &\textbf{Mean} &\textbf{St.} &\textbf{Dev.} &\textbf{Min} &\textbf{Median} &\textbf{Max}	&\textbf{Histogram}\\
% 		\midrule
% 		\textbf{Project level metrics}\hspace{1em} \\
% 			Project\_age &	& & & & & &  		\\
% 			Team\_size		&	& & & & & & 	\\
% 			Pr\_growth	&	& & & & & 	& \\
% 		\midrule
% 		\textbf{Pull-request level metrics}\hspace{1em} \\
% 			Churn				&	& & & & & & 		\\
% 			N\_files			&	& & & & & & 		\\
% 			Src\_touched		&	& & & & & & 		\\
% 			config\_ touched		&	& & & & & & 		\\
% 			test\_touched	&	& & &&  & & 	\\
% 		\midrule
% 		\textbf{Contributor level metrics}\hspace{1em} \\
% 			core team			&	& & & & & & 		\\
% 			Prev\_prs\_local		&	& & & & & & 		\\
% 			Prev\_prs\_global	&	& & &&  & & 	\\
% 		\midrule
% 		\textbf{Comment level metrics}\hspace{1em} \\
% 			Coment\_length	&	& & &&  & & 			\\
% 			Coment\_type	&	& & &&  & & 			\\
% 			sim\_pr\_title	&	& & &&  & & 			\\
% 			Sim\_pr\_desc	&	& & &&  & & 			\\
% 			Code\_inclusion	&	& & &&  & & 			\\
% 			Ref\_inclusion	&	& & &&  & & 			\\
% 			Link\_inclusion	&	& & &&  & & 			\\
% 			Ping\_inclusion	&	& & &&  & & 			\\

% 		\bottomrule
% .	\end{tabular}
% 	\label{tab:factors}
% \end{table*}


% \begin{table*}[ht]
% 	\centering
% 	\caption{Dataset of our experiments}
% 	\begin{tabular}{r c c c c c c}
% 		\toprule
% 		&\multicolumn{2}{c}{Var. 1}    & \multicolumn{2}{c}{Var. 2}    &\multicolumn{2}{c}{Var. 3}\\
% 		& Coeffs(Errors) &Sum Sq.   &Coeffs(Errors) &Sum Sq.     &Coeffs(Errors) &Sum Sq.\\
% 		\midrule
% 			Project\_age &	& & & & &   		\\
% 			Team\_size		&	& & & & &  	\\
% 			Pr\_growth	&	& & & & & 	 \\
% 		\midrule
% 			Churn				&	& & & & & 		\\
% 			N\_files			&	& & & & &  		\\
% 			Src\_touched		&	& & & & &  		\\
% 			config\_ touched		&	& & & & &  		\\
% 			test\_touched	&	& & &&  &  	\\
% 		\midrule
% 			core team			&	& & & & &  		\\
% 			Prev\_prs\_local		&	& & & & &  		\\
% 			Prev\_prs\_global	&	& & &&  &  	\\
% 		\bottomrule
% 		\multicolumn{7}{l}{\emph{*** p \textless 0.001, ** p \textless 0.01, *p \textless 0.05}}
% .	\end{tabular}
% 	\label{tab:lg}
% \end{table*}

% %各种factor对消极comment的影响
% \noindent\underline{\textbf{Project level}}

% \noindent\textbf{Project\_age:} the time from project creation on GitHub to the pull-request creation in months, which is used as a proxy for maturity.
% %创建时间

% \noindent\textbf{Team\_size:} the number of active integrators, who decide whether to accept the new contributions, during the three months prior to pull-request creation.
% %表 site_prj_comments core_member 近*个月（周）是否有活动

% \noindent\textbf{Pr\_growth:} Average number of pull requests submitted \hl{per day}, prior to the examined pull request.
% %表 site_prj_prs pr在近*天（周）的新增数


% \noindent\underline{\textbf{Pull-request level}}
% %表 site_pr_infl

% \noindent\textbf{Churn:} total number of lines added and deleted by the pull-request. Bigger code changes may be more complex, and require longer code evaluation.

% \noindent\textbf{N\_files:} total number of files changed in the pull-request, which is a signal of pull-request size.

% \noindent\textbf{Src\_touched:} binary, measuring if the pull-request touched at least one source code file.(documentation 看成是source code file)

% \noindent\textbf{Config\_touched:} binary, measuring if the pull-request touched at least one configure file.

% \noindent\textbf{Test\_touched:} binary, measuring if the pull-request touched at least one test file.

% \noindent\underline{\textbf{Contributor level}}

% \noindent\textbf{Core\_team:} binary core\_contributor, outside\_contributor, binary, whether the submitter is member of core development team.
% %表 user_role

% \noindent\textbf{Prev\_prs\_local:} Number of pull requests submitted to a specific project by a developer, prior to the examined pull request.
% %表 site_prj_prs  利用 author_name 选出所有pr 然后根据时间排序

% \noindent\textbf{Prev\_prs\_global:} Number of pull requests submitted by a developer in Github, prior to the examined pull request.
% %表 user_pr  利用 user 选出所有pr 然后根据时间排序