minor refinement

2020-06-10 19:36:05 +08:00 · 2020-06-10 19:36:05 +08:00 · 640a64eb2a
parent eeb3d63b80
commit 640a64eb2a
5 changed files with 431 additions and 198 deletions
--- a/4_rq2.tex
+++ b/4_rq2.tex
@ -568,6 +568,95 @@ we identify \hl{**} metrics that can be computed  at  pull  request submission t
 The identified metrics are classified into the following three categories:


+
+
+\vspace{0.5em} 
+\noindent\textbf{Project-level characteristics.} 
+
+\vspace{0.2em} 
+\textit{Maturity.}
+Previous studies used the metric \texttt{proj\_age},
+\ie the period of time from the time the project was hosted on GitHub to the pull request submission time, 
+as an indicator of the project maturity~\cite{Tsay2014Influence,yu16det,Rahman2014An}.
+However, 
+a project does not necessarily use the pull request model in the first place.
+We also use the metric \texttt{prmodel\_age} to indicate 
+how long a project has adopted the pull request development model.
+
+
+\vspace{0.2em} 
+\textit{Workload.}
+The discussion of issues and pull requests might cost days to months to come to an end.
+At any given time, 
+a bunch of open issues and pull requests 
+might be discussed simultaneously.
+Prior studies have characterized project integrators' workload using two metrics:
+\texttt{open\_tasks}~\cite{yu16det} and  \texttt{team\_size}~\cite{Tsay2014Influence,Gousios:2014,yu16det},
+which are the number of open issues and open pull requests at the pull request submission and 
+the number of active core team members during the last three months, respectively.
+
+
+\vspace{0.2em} 
+\textit{Popularity.}
+In measuring project popularity,
+the metric \texttt{stars},
+\ie the number of stars the project has got, 
+was commonly used in prior studies~\cite{Bor17Und,Tsay2014Influence}.
+In addition, 
+we also considered three other popularity-related metrics:
+\texttt{forks}, \texttt{prs}, and \texttt{contributors},
+which are the number of forks, the number of pull requests, and the number of contributors 
+of the project, respectively.
+
+
+% \vspace{0.2em} 
+% \textit{Activeness.}
+% We further measure the activeness of the project
+% by including four metrics: 
+% \texttt{stars\_3M}, \texttt{forks\_3M}, \texttt{prs\_3M}, and \texttt{contributors\_3M}, 
+% which are the new stars, new forks, new pull requests, and active contributors in the project in the last three months, respectively.
+
+
+
+\vspace{0.5em} 
+\noindent\textbf{Submitter-level characteristics.} 
+
+\vspace{0.2em} 
+\textit{Experience.}
+Developers' experience before they submit the pull request has been analyzed in prior studies~\cite{Gousios:2014,jiang2013will}.
+This measure can be computed from two perspectives: 
+project-level experience and community-level experience.
+The former measures the number of previous pull requests  
+that have submitted to a specific project (\texttt{prev\_prs\_proj}) and their acceptance rate (\texttt{prev\_acc\_proj}).
+The latter measures the number of previous pull requests 
+that have been submitted to GitHub (\texttt{prev\_prs})  and their acceptance rate (\texttt{prev\_acc}).
+When calculating acceptance rate,
+the determination of whether the pull request was integrated 
+through other mechanisms than GitHub's merge button follows the heuristics defined in previous studies~\cite{Gousios:2014,zhou2019fork}.
+We also use two metrics \texttt{first\_pr\_proj} and 
+\texttt{first\_pr} to represent whether the pull request is the first one submitted by a developer to a specific project and GitHub, respectively.
+
+\vspace{0.2em} 
+\textit{Standing.}
+A dichotomous metric \texttt{core\_team},
+which indicates whether the pull request submitter is the core team member of the project,
+was commonly used as a signal of the developer's standing within the project~\cite{Tsay2014Influence,yu16det}.
+Furthermore, 
+a continuous metric \texttt{followers}, 
+\ie the number of GitHub users that are following the pull request submitter, 
+was used to represent the developers' standing within the community~\cite{Tsay2014Influence,Gousios:2014,yu16det}.
+
+
+\vspace{0.2em} 
+\textit{Previous interaction.}
+This metric (\texttt{prev\_interaction}) is the total number of events, 
+\eg such as commenting on issues and pull requests, 
+prior to the pull request submission 
+that the developer has participated in within the project~\cite{Tsay2014Influence,yu16det}.
+
+
+
+
 \vspace{0.5em} 
 \noindent\textbf{Patch-level characteristics.} 

@ -621,89 +710,6 @@ and \textit{Doc} changing documentation files.
 This metric (\texttt{activity\_type}) is determined by checking the names and extensions of changed files.


-\vspace{0.5em} 
-\noindent\textbf{Submitter-level characteristics.} 
-
-\vspace{0.2em} 
-\textit{Experience.}
-Developers' experience before they submit the pull request has been analyzed in prior studies~\cite{Gousios:2014,jiang2013will}.
-This measure can be computed from two perspectives: 
-project-level experience and community-level experience.
-The former measures the number of previous pull requests  
-that have submitted to a specific project (\texttt{prev\_prs\_proj}) and their acceptance rate (\texttt{prev\_acc\_proj}).
-The latter measures the number of previous pull requests 
-that have been submitted to GitHub (\texttt{prev\_prs})  and their acceptance rate (\texttt{prev\_acc}).
-When calculating acceptance rate,
-the determination of whether the pull request was integrated 
-through other mechanisms than GitHub's merge button follows the heuristics defined in previous studies~\cite{Gousios:2014,zhou2019fork}.
-We also use two metrics \texttt{first\_pr\_proj} and 
-\texttt{first\_pr} to represent whether the pull request is the first one submitted by a developer to a specific project and GitHub, respectively.
-
-\vspace{0.2em} 
-\textit{Standing.}
-A dichotomous metric \texttt{core\_team},
-which indicates whether the pull request submitter is the core team member of the project,
-was commonly used as a signal of the developer's standing within the project~\cite{Tsay2014Influence,yu16det}.
-Furthermore, 
-a continuous metric \texttt{followers}, 
-\ie the number of GitHub users that are following the pull request submitter, 
-was used to represent the developers' standing within the community~\cite{Tsay2014Influence,Gousios:2014,yu16det}.
-
-
-\vspace{0.2em} 
-\textit{Previous interaction.}
-This metric (\texttt{prev\_interaction}) is the total number of events, 
-\eg such as commenting on issues and pull requests, 
-prior to the pull request submission 
-that the developer has participated in within the project~\cite{Tsay2014Influence,yu16det}.
-
-
-
-\vspace{0.5em} 
-\noindent\textbf{Project-level characteristics.} 
-
-\vspace{0.2em} 
-\textit{Maturity.}
-Previous studies used the metric \texttt{proj\_age},
-\ie the period of time from the time the project was hosted on GitHub to the pull request submission time, 
-as an indicator of the project maturity~\cite{Tsay2014Influence,yu16det,Rahman2014An}.
-However, 
-a project does not necessarily use the pull request model in the first place.
-We also use the metric \texttt{prmodel\_age} to indicate 
-how long a project has adopted the pull request development model.
-
-
-\vspace{0.2em} 
-\textit{Workload.}
-The discussion of issues and pull requests might cost days to months to come to an end.
-At any given time, 
-a bunch of open issues and pull requests 
-might be discussed simultaneously.
-Prior studies have characterized project integrators' workload using two metrics:
-\texttt{open\_tasks}~\cite{yu16det} and  \texttt{team\_size}~\cite{Tsay2014Influence,Gousios:2014,yu16det},
-which are the number of open issues and open pull requests at the pull request submission and 
-the number of active core team members during the last three months, respectively.
-
-
-\vspace{0.2em} 
-\textit{Popularity.}
-In measuring project popularity,
-the metric \texttt{stars},
-\ie the number of stars the project has got, 
-was commonly used in prior studies~\cite{Bor17Und,Tsay2014Influence}.
-In addition, 
-we also considered three other popularity-related metrics:
-\texttt{forks}, \texttt{prs}, and \texttt{contributors},
-which are the number of forks, the number of pull requests, and the number of contributors 
-of the project, respectively.
-
-
-\vspace{0.2em} 
-\textit{Activeness.}
-We further measure the activeness of the project
-by including four metrics: 
-\texttt{stars\_3M}, \texttt{forks\_3M}, \texttt{prs\_3M}, and \texttt{contributors\_3M}, 
-which are the new stars, new forks, new pull requests, and active contributors in the project in the last three months, respectively.


 \subsubsection{Comparative exploration}
@ -732,8 +738,8 @@ $H_{0}$: duplicate pull requests exhibit a value of metric $m$ equal to that one
 \texttt{proj\_age}, \texttt{prmodel\_age}, 
 \hl{\texttt{open\_tasks}}
 \texttt{open\_issues}, \texttt{open\_prs}, \texttt{team\_size}, 
-\texttt{forks}, \texttt{stars}, \texttt{prs}, \texttt{contributors}, 
-\texttt{forks\_3M}, \texttt{stars\_3M}, \texttt{prs\_3M}, \texttt{contributors\_3M}\}
+\texttt{forks}, \texttt{stars}, \texttt{prs}, \texttt{contributors}
+\}


 \vspace{0.2em} 
@ -761,6 +767,56 @@ are different in terms of all metrics except for that they have similar number o
 		\toprule	
 		\multicolumn{2}{r}{\textbf{Metric}} &\tabincell{c}{\textbf{Effect}\\\textbf{size}} &\tabincell{c}{\textbf{\textit{Adjusted}}\\\textbf{\textit{p-value}}}\\

+
+		\midrule
+		\multicolumn{4}{@{}l}{\textbf{Project-level characteristics}}\\
+		
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{2}{*}{Maturity}
+		& \texttt{proj\_age}&0.166 & 2.95e-21 ***\\
+		& \texttt{prmodel\_age}& -0.108 &2.34e-11 ***\\
+
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{3}{*}{Wordload}
+		& \texttt{open\_tasks}&0.177 &2.00e-37 ***\\
+		& \texttt{team\_size}& 0.257& 1.98e-41 ***\\
+
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{4}{*}{Popularity}
+		& \texttt{forks}& -0.327& 3.06e-61 ***\\ 
+		& \texttt{watchers}& -0.289& 9.77e-47 ***\\
+		& \texttt{prs}& 0.187& 5.34e-19 ***\\
+		& \texttt{contributors}& -0.238& 3.71e-49 ***\\
+
+		% \cdashline{1-4}[0.8pt/2pt]
+		% \multirow{4}{*}{Activeness}
+		% & \texttt{forks\_3M}& -0.315& 6.39e-51 ***\\
+		% & \texttt{watchers\_3M}& -0.178& 2.07e-12 ***\\
+		% & \texttt{prs\_3M}& 0.280& 7.11e-85 ***\\
+		% & \texttt{contributors\_3M}& -0.115 & 2.31e-07 ***\\
+
+		\midrule
+		\multicolumn{4}{@{}l}{\textbf{Submitter-level characteristics}}\\
+		
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{4}{*}{Experience}
+		& \texttt{prev\_prs\_porj}&0.281 & 5.46e-192 ***\\
+		& \texttt{prev\_prs}& 0.227& 1.82e-94 ***\\
+		& \texttt{first\_pr\_proj}& -0.435 & 3.21e-149 ***\\
+		& \texttt{first\_pr}& -0.200 & 3.91e-33 ***\\
+		
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{2}{*}{Standing}
+		& \texttt{core\_team}& 0.385&2.40e-117 ***\\
+		& \texttt{followers}&-0.002 & 8.78e-21 ***\\
+		
+		\cdashline{1-4}[0.8pt/2pt]
+		\multirow{1}{*}{Interaction}
+		% & \texttt{followings}& -0.012 &3.21e-06 ***\\
+		% & \texttt{watch\_proj}&0.124 &1.28e-13 ***\\
+		& \texttt{prev\_interaction}&0.163 & 9.81e-108 ***\\
+
+
 		\midrule
 		\multicolumn{4}{@{}l}{\textbf{Patch-level characteristics}}\\
 		\cdashline{1-4}[0.8pt/2pt]
@ -781,54 +837,7 @@ are different in terms of all metrics except for that they have similar number o
 		&\texttt{change\_type}& 0.055& 9.73e-4 ***\\
 		&\texttt{activity\_type}& 0.055& 9.73e-4 ***\\

-		\midrule
-		\multicolumn{4}{@{}l}{\textbf{Submitter-level characteristics}}\\
-		
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{4}{*}{Experience}
-		& \texttt{prev\_prs\_porj}&0.281 & 5.46e-192 ***\\
-		& \texttt{prev\_prs}& 0.227& 1.82e-94 ***\\
-		& \texttt{first\_pr\_proj}& -0.435 & 3.21e-149 ***\\
-		& \texttt{first\_pr}& -0.200 & 3.91e-33 ***\\
-		
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{2}{*}{Standing}
-		& \texttt{core\_team}& 0.385&2.40e-117 ***\\
-		& \texttt{followers}&-0.002 & 8.78e-21 ***\\
-		
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{3}{*}{Connection}
-		& \texttt{followings}& -0.012 &3.21e-06 ***\\
-		& \texttt{watch\_proj}&0.124 &1.28e-13 ***\\
-		& \texttt{prev\_interaction}&0.163 & 9.81e-108 ***\\

-		\midrule
-		\multicolumn{4}{@{}l}{\textbf{Project-level characteristics}}\\
-		
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{2}{*}{Maturity}
-		& \texttt{proj\_age}&0.166 & 2.95e-21 ***\\
-		& \texttt{prmodel\_age}& -0.108 &2.34e-11 ***\\
-
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{3}{*}{Wordload}
-		& \texttt{open\_issues}&0.177 &2.00e-37 ***\\
-		& \texttt{open\_prs}& 0.011& 4.71e-32 ***\\
-		& \texttt{team\_size}& 0.257& 1.98e-41 ***\\
-
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{4}{*}{Popularity}
-		& \texttt{forks}& -0.327& 3.06e-61 ***\\ 
-		& \texttt{watchers}& -0.289& 9.77e-47 ***\\
-		& \texttt{prs}& 0.187& 5.34e-19 ***\\
-		& \texttt{contributors}& -0.238& 3.71e-49 ***\\
-
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{4}{*}{Activeness}
-		& \texttt{forks\_3M}& -0.315& 6.39e-51 ***\\
-		& \texttt{watchers\_3M}& -0.178& 2.07e-12 ***\\
-		& \texttt{prs\_3M}& 0.280& 7.11e-85 ***\\
-		& \texttt{contributors\_3M}& -0.115 & 2.31e-07 ***\\
 	
 		\bottomrule
 	\end{tabularx}
@ -924,85 +933,53 @@ For project-level metrics,
 	\renewcommand{\arraystretch}{1.15} 
 	\centering
 	\caption{\color{red}{Statistical models for the likelihood of duplicate pull requests}}
-	\begin{tabularx}{\textwidth}{@{}l r Y Y Y Y Y Y Y Y Y@{}}
+	\begin{tabularx}{\textwidth}{@{}r Y Y Y Y Y Y Y Y Y@{}}
 		\toprule

 		
-		&& \multicolumn{3}{c}{\textbf{Model 1}}& \multicolumn{3}{c}{\textbf{Model 2}} & \multicolumn{3}{c}{\textbf{Model 3}}\\ 
-		& & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}& \multicolumn{3}{c}{response: \textit{is\_dup} = 1} & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}\\
-		 \cmidrule(r){3-5} \cmidrule(r){6-8} \cmidrule(r){9-11}
-		 && Coeffs. & Errors & Signif. & Coeffs. & Errors & Signif.& Coeffs. & Errors & Signif.\\
+		& \multicolumn{3}{c}{\textbf{Model 1}}& \multicolumn{3}{c}{\textbf{Model 2}} & \multicolumn{3}{c}{\textbf{Model 3}}\\ 
+		 & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}& \multicolumn{3}{c}{response: \textit{is\_dup} = 1} & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}\\
+		 \cmidrule(r){2-4} \cmidrule(r){5-7} \cmidrule(r){8-10}
+		 & Coeffs. & Errors & Signif. & Coeffs. & Errors & Signif.& Coeffs. & Errors & Signif.\\
+
+
+		 \midrule
+		 % \multicolumn{4}{@{}l}{\textbf{Project characteristics}}\\
+		 \texttt{open\_tasks}& & & & & & & & &\\
+		 \texttt{team\_size}& & & & & & & & &\\
+		\texttt{watchers}& & & & & & & & &\\

-		\midrule					
-		% \multicolumn{4}{@{}l}{\textbf{Patch characteristics}}\\
-		\multirow{3}{*}{Size}
-		&\texttt{commits} & & & & & & & & &\\
-		& \texttt{files} & & & & & & & & & \\
-		& \texttt{churn}& & & & & & & & &\\
-		\cdashline{1-11}[0.8pt/2pt]

-		\multirow{2}{*}{Text}
-		&\texttt{title\_len}& & & & & & & & &\\
-		& \texttt{desc\_len}& & & & & & & & &\\
-		\cdashline{1-11}[0.8pt/2pt]
-		Hotness &\texttt{hotness}& & & & & & & & &\\
-		\cdashline{1-11}[0.8pt/2pt]
-		Reference& \texttt{issue\_tag}& & & & & & & & &\\
-		\cdashline{1-4}[0.8pt/2pt]
-		\multirow{2}{*}{Type}
-		&\texttt{change\_type}& & & & & & & & &\\
-		&\texttt{activity\_type}& & & & & & & & &\\

 		\midrule
 		% \multicolumn{4}{@{}l}{\textbf{Submitter characteristics}}\\
-		% \cdashline{1-11}[0.8pt/2pt]
-		\multirow{4}{*}{Experience}
-		& \texttt{prev\_prs\_porj}& & & & & & & & &\\
-		& \texttt{prev\_prs}& & & & & & & & &\\
-		& \texttt{first\_pr\_proj}& & & & & & & & &\\
-		& \texttt{first\_pr}& & & & & & & & &\\
-		
-		\cdashline{1-11}[0.8pt/2pt]
-		\multirow{3}{*}{Status}
-		& \texttt{core\_team}& & & & & & & & &\\
-		& \texttt{followers}& & & & & & & & &\\
-		
-		\cdashline{1-11}[0.8pt/2pt]
-		\multirow{3}{*}{Connection}
-		& \texttt{followings}& & & & & & & & &\\
-		& \texttt{watch\_proj}& & & & & & & & &\\
-		& \texttt{prev\_interaction}& & & & & & & & &\\
+		\texttt{prev\_prs}& & & & & & & & &\\
+		\texttt{prev\_prs\_acc}& & & & & & & & &\\
+		\texttt{first\_pr\_proj}& & & & & & & & &\\
+		\texttt{core\_team}& & & & & & & & &\\
+		\texttt{followers}& & & & & & & & &\\
+		\texttt{prev\_interaction}& & & & & & & & &\\

-		\midrule
-		% \multicolumn{4}{@{}l}{\textbf{Project characteristics}}\\
-		% \cdashline{1-11}[0.8pt/2pt]
-		\multirow{2}{*}{Maturity}
-		& \texttt{proj\_age}& & & & & & & & &\\
-		& \texttt{prmodel\_age}& & & & & & & & &\\

-		\cdashline{1-11}[0.8pt/2pt]
-		\multirow{3}{*}{Wordload}
-		& \texttt{open\_issues}& & & & & & & & &\\
-		& \texttt{open\_prs}& & & & & & & & &\\
-		& \texttt{team\_size}& & & & & & & & &\\
+				 
+		\midrule					
+		% \multicolumn{4}{@{}l}{\textbf{Patch characteristics}}\\
+		\texttt{commits} & & & & & & & & &\\
+		\texttt{files} & & & & & & & & & \\
+		\texttt{churn}& & & & & & & & &\\
+
+		\texttt{title\_len}& & & & & & & & &\\
+		\texttt{desc\_len}& & & & & & & & &\\
+		\texttt{hotness}& & & & & & & & &\\
+		\texttt{issue\_tag}& & & & & & & & &\\
+		\texttt{change\_type}& & & & & & & & &\\
+		\texttt{activity\_type}& & & & & & & & &\\

-		\cdashline{1-11}[0.8pt/2pt]
-		\multirow{4}{*}{Popularity}
-		& \texttt{forks}& & & & & & & & &\\ 
-		& \texttt{watchers}& & & & & & & & &\\
-		& \texttt{prs}& & & & & & & & &\\
-		& \texttt{contributors}& & & & & & & & &\\

-		\cdashline{1-11}[0.8pt/2pt]
-		\multirow{4}{*}{Activeness}
-		& \texttt{forks\_3M}& & & & & & & & &\\
-		& \texttt{watchers\_3M}& & & & & & & & &\\
-		& \texttt{prs\_3M}& & & & & & & & &\\
-		& \texttt{contributors\_3M}& & & & & & & & &\\

 		\midrule

-		\multicolumn{2}{r}{Area Under the ROC Curve:} &\multicolumn{3}{c}{0.661}& \multicolumn{3}{c}{0.845} & \multicolumn{3}{c}{0.871}\\ 
+		\multicolumn{1}{r}{Area Under the ROC Curve:} &\multicolumn{3}{c}{0.661}& \multicolumn{3}{c}{0.845} & \multicolumn{3}{c}{0.871}\\ 

 		\bottomrule
 	\end{tabularx}
--- a/5_discussion.tex
+++ b/5_discussion.tex
@ -78,7 +78,7 @@ if they can fulfil developers' actual demands in maintaining awareness.


 \vspace{0.5em}  
-\noindent \textbf{A mismatch between awareness information and actual outcomes/***.}
+\noindent \textbf{A mismatch between awareness *** actual information exchange.}
 Maintaining awareness is dual.
 Intuitively,
 it means developers need to \textit{gather external information} to stay aware of others' activities.
@ -90,7 +90,9 @@ This hinders other developers' ability to gather adequate contextual information
 Although prior work~\cite{Treude2010Awareness,Arora2016Supporting,Calefato2012Social} has extensively studied on 
 how to help developers track work and get information,
 more research attention should be paid to encouraging developers to share awareness information.
-% 不share是不是因为性格？是不是因为working style
+For example,
+it would be interesting to investigate 
+where developers' willingness to share information is affected by the characteristics of collaboration mechanisms and communication tools.


 Obviously,
--- a/experiment_code/major_revision/rq2/rq2-yy.R
+++ b/experiment_code/major_revision/rq2/rq2-yy.R
@ -0,0 +1,127 @@
+setwd("~/Desktop")
+
+#########################################################
+# load data
+rq2_metrics <- read.csv("rq2_metrics.csv")
+summary(rq2_metrics)
+# remove NA data
+rq2_data = rq2_metrics
+rq2_data[is.na(rq2_data$prev_pr_acc_proj),]$prev_pr_acc_proj = 0
+rq2_data[is.na(rq2_data$prev_prs_acc),]$prev_prs_acc = 0
+rq2_data[is.na(rq2_data$open_issues),]$open_issues = 0
+
+summary(rq2_data)
+
+# prev_prs_acc_projΪnullֱ??ȥ???ǲ??ǲ?̫??
+rq2_m  <- rq2_data[complete.cases(rq2_data),]
+summary(rq2_m)
+#########################################################
+
+
+#########################################################
+# categorical metrics
+rq2_m$is_dup = as.logical(rq2_m$is_dup)
+rq2_m$file_type = as.factor(rq2_m$file_type)
+rq2_m$first_pr_proj = as.logical(rq2_m$first_pr_proj)
+rq2_m$first_pr = as.logical(rq2_m$first_pr)
+rq2_m$core_team = as.logical(rq2_m$core_team)
+rq2_m$watch_proj = as.logical(rq2_m$watch_proj)
+rq2_m$issue_tag = as.logical(rq2_m$issue_tag)
+rq2_m$prj_id = as.factor(rq2_m$prj_id)
+rq2_m$change_type = as.factor(rq2_m$change_type)
+
+#########################################################
+summary(rq2_m)
+
+rq2_m1 = subset(rq2_m,  loc<100000 & desc_len<10000)
+
+
+library(pROC)
+library(car)
+library(lme4)
+
+#setp models----------------
+#proj level---------------
+gbg_1 = glmer(formula = is_dup ~ 
+                
+                log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)
+              + log(hotness+0.5) 
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+
+vif(gbg_1)
+summary(gbg_1)
+prob=predict(gbg_1, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+#proj level and submitter level---------------
+gbg_2 = glmer(formula = is_dup ~ 
+
+                log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)
+              + log(hotness+0.5) 
+              
+              + log(prev_pullreqs+0.5)
+              + log(prev_prs_acc + 0.5)
+              + first_pr_proj
+              + log(followers+0.5)
+              + core_team 
+              + log(prior_interaction+0.5)
+              
+              
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+vif(gbg_2)
+summary(gbg_2)
+prob=predict(gbg_2, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+
+#proj level, submitter level and PR level---------------
+gbg_3 = glmer(formula = is_dup ~ 
+                log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)
+              + log(hotness+0.5) 
+              
+              + log(prev_pullreqs+0.5)
+              + log(prev_prs_acc + 0.5)
+              + first_pr_proj
+              + log(followers+0.5)
+              + core_team 
+              + log(prior_interaction+0.5)
+            
+              
+              + log(commits+0.5) + log(files_changed+0.5) + log(loc+0.5)
+              + log(title_len+0.5)
+              + log(desc_len + 0.5) 
+              + log(title_len+desc_len)
+              + issue_tag
+              + change_type
+              + file_type
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+vif(gbg_3)
+summary(gbg_3)
+prob=predict(gbg_3, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+
--- a/experiment_code/major_revision/初步修订计划-english.docx
+++ b/experiment_code/major_revision/初步修订计划-english.docx
--- a/experiment_code/rq2/rq2-yy.R
+++ b/experiment_code/rq2/rq2-yy.R
@ -0,0 +1,127 @@
+setwd("~/Desktop")
+
+#########################################################
+# load data
+rq2_metrics <- read.csv("rq2_metrics.csv")
+summary(rq2_metrics)
+# remove NA data
+rq2_data = rq2_metrics
+rq2_data[is.na(rq2_data$prev_pr_acc_proj),]$prev_pr_acc_proj = 0
+rq2_data[is.na(rq2_data$prev_prs_acc),]$prev_prs_acc = 0
+rq2_data[is.na(rq2_data$open_issues),]$open_issues = 0
+
+summary(rq2_data)
+
+# prev_prs_acc_proj为null直接去掉是不是不太好
+rq2_m  <- rq2_data[complete.cases(rq2_data),]
+summary(rq2_m)
+#########################################################
+
+
+#########################################################
+# categorical metrics
+rq2_m$is_dup = as.logical(rq2_m$is_dup)
+#rq2_m$pr_type = as.factor(rq2_m$pr_type)
+rq2_m$file_type = as.factor(rq2_m$file_type)
+rq2_m$first_pr_proj = as.logical(rq2_m$first_pr_proj)
+rq2_m$first_pr = as.logical(rq2_m$first_pr)
+rq2_m$core_team = as.logical(rq2_m$core_team)
+rq2_m$watch_proj = as.logical(rq2_m$watch_proj)
+rq2_m$issue_tag = as.logical(rq2_m$issue_tag)
+rq2_m$prj_id = as.factor(rq2_m$prj_id)
+rq2_m$change_type = as.factor(rq2_m$change_type)
+
+#########################################################
+summary(rq2_m)
+
+rq2_m1 = subset(rq2_m,  loc<100000 & desc_len<10000)
+
+
+library(pROC)
+library(car)
+library(lme4)
+
+#setp models----------------
+#proj level---------------
+gbg_1 = glmer(formula = is_dup ~ 
+                log(proj_age+0.5) 
+              + log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)            
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+
+vif(gbg_1)
+summary(gbg_1)
+prob=predict(gbg_1, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+#proj level and submitter level---------------
+gbg_2 = glmer(formula = is_dup ~ 
+                log(proj_age+0.5) 
+              + log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)
+              #+ log(forks_3M+0.5) 
+              #+ log(watchers_3M+0.5) 
+              #+ log(pullreqs_3M+0.5)
+              
+              + log(prev_pullreqs+0.5)
+              + first_pr_proj
+              #+ log(prev_pullreqs_proj + 0.5)
+              + log(prev_pr_acc_proj + 0.5)
+              + log(followers+0.5)
+              + core_team 
+              + log(prior_interaction+0.5)
+              
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+vif(gbg_2)
+summary(gbg_2)
+prob=predict(gbg_2, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+
+#proj level, submitter level and PR level---------------
+gbg_3 = glmer(formula = is_dup ~ 
+                log(proj_age+0.5) 
+              + log(open_tasks+0.5) 
+              + log(team_size+0.5) 
+              + log(watchers+0.5)
+              
+              + log(prev_pullreqs+0.5)
+              + log(prev_prs_acc + 0.5)
+              + first_pr_proj
+              + log(followers+0.5)
+              + core_team 
+              + log(prior_interaction+0.5)
+              
+              + log(hotness+0.5) 
+              + log(commits+0.5) + log(files_changed+0.5) + log(loc+0.5)
+              + log(title_len+desc_len) 
+              + issue_tag
+              + change_type
+              + file_type
+              
+              + (1|prj_id),
+              data=rq2_m1, 
+              family="binomial"
+)
+vif(gbg_3)
+summary(gbg_3)
+prob=predict(gbg_3, type=c("response"))
+rq2_m1$prob=prob
+a = roc(is_dup ~ prob, data = rq2_m1)
+a
+
+