minor refinement

This commit is contained in:
whystar 2020-06-10 19:36:05 +08:00
parent eeb3d63b80
commit 640a64eb2a
5 changed files with 431 additions and 198 deletions

369
4_rq2.tex
View File

@ -568,6 +568,95 @@ we identify \hl{**} metrics that can be computed at pull request submission t
The identified metrics are classified into the following three categories:
\vspace{0.5em}
\noindent\textbf{Project-level characteristics.}
\vspace{0.2em}
\textit{Maturity.}
Previous studies used the metric \texttt{proj\_age},
\ie the period of time from the time the project was hosted on GitHub to the pull request submission time,
as an indicator of the project maturity~\cite{Tsay2014Influence,yu16det,Rahman2014An}.
However,
a project does not necessarily use the pull request model in the first place.
We also use the metric \texttt{prmodel\_age} to indicate
how long a project has adopted the pull request development model.
\vspace{0.2em}
\textit{Workload.}
The discussion of issues and pull requests might cost days to months to come to an end.
At any given time,
a bunch of open issues and pull requests
might be discussed simultaneously.
Prior studies have characterized project integrators' workload using two metrics:
\texttt{open\_tasks}~\cite{yu16det} and \texttt{team\_size}~\cite{Tsay2014Influence,Gousios:2014,yu16det},
which are the number of open issues and open pull requests at the pull request submission and
the number of active core team members during the last three months, respectively.
\vspace{0.2em}
\textit{Popularity.}
In measuring project popularity,
the metric \texttt{stars},
\ie the number of stars the project has got,
was commonly used in prior studies~\cite{Bor17Und,Tsay2014Influence}.
In addition,
we also considered three other popularity-related metrics:
\texttt{forks}, \texttt{prs}, and \texttt{contributors},
which are the number of forks, the number of pull requests, and the number of contributors
of the project, respectively.
% \vspace{0.2em}
% \textit{Activeness.}
% We further measure the activeness of the project
% by including four metrics:
% \texttt{stars\_3M}, \texttt{forks\_3M}, \texttt{prs\_3M}, and \texttt{contributors\_3M},
% which are the new stars, new forks, new pull requests, and active contributors in the project in the last three months, respectively.
\vspace{0.5em}
\noindent\textbf{Submitter-level characteristics.}
\vspace{0.2em}
\textit{Experience.}
Developers' experience before they submit the pull request has been analyzed in prior studies~\cite{Gousios:2014,jiang2013will}.
This measure can be computed from two perspectives:
project-level experience and community-level experience.
The former measures the number of previous pull requests
that have submitted to a specific project (\texttt{prev\_prs\_proj}) and their acceptance rate (\texttt{prev\_acc\_proj}).
The latter measures the number of previous pull requests
that have been submitted to GitHub (\texttt{prev\_prs}) and their acceptance rate (\texttt{prev\_acc}).
When calculating acceptance rate,
the determination of whether the pull request was integrated
through other mechanisms than GitHub's merge button follows the heuristics defined in previous studies~\cite{Gousios:2014,zhou2019fork}.
We also use two metrics \texttt{first\_pr\_proj} and
\texttt{first\_pr} to represent whether the pull request is the first one submitted by a developer to a specific project and GitHub, respectively.
\vspace{0.2em}
\textit{Standing.}
A dichotomous metric \texttt{core\_team},
which indicates whether the pull request submitter is the core team member of the project,
was commonly used as a signal of the developer's standing within the project~\cite{Tsay2014Influence,yu16det}.
Furthermore,
a continuous metric \texttt{followers},
\ie the number of GitHub users that are following the pull request submitter,
was used to represent the developers' standing within the community~\cite{Tsay2014Influence,Gousios:2014,yu16det}.
\vspace{0.2em}
\textit{Previous interaction.}
This metric (\texttt{prev\_interaction}) is the total number of events,
\eg such as commenting on issues and pull requests,
prior to the pull request submission
that the developer has participated in within the project~\cite{Tsay2014Influence,yu16det}.
\vspace{0.5em}
\noindent\textbf{Patch-level characteristics.}
@ -621,89 +710,6 @@ and \textit{Doc} changing documentation files.
This metric (\texttt{activity\_type}) is determined by checking the names and extensions of changed files.
\vspace{0.5em}
\noindent\textbf{Submitter-level characteristics.}
\vspace{0.2em}
\textit{Experience.}
Developers' experience before they submit the pull request has been analyzed in prior studies~\cite{Gousios:2014,jiang2013will}.
This measure can be computed from two perspectives:
project-level experience and community-level experience.
The former measures the number of previous pull requests
that have submitted to a specific project (\texttt{prev\_prs\_proj}) and their acceptance rate (\texttt{prev\_acc\_proj}).
The latter measures the number of previous pull requests
that have been submitted to GitHub (\texttt{prev\_prs}) and their acceptance rate (\texttt{prev\_acc}).
When calculating acceptance rate,
the determination of whether the pull request was integrated
through other mechanisms than GitHub's merge button follows the heuristics defined in previous studies~\cite{Gousios:2014,zhou2019fork}.
We also use two metrics \texttt{first\_pr\_proj} and
\texttt{first\_pr} to represent whether the pull request is the first one submitted by a developer to a specific project and GitHub, respectively.
\vspace{0.2em}
\textit{Standing.}
A dichotomous metric \texttt{core\_team},
which indicates whether the pull request submitter is the core team member of the project,
was commonly used as a signal of the developer's standing within the project~\cite{Tsay2014Influence,yu16det}.
Furthermore,
a continuous metric \texttt{followers},
\ie the number of GitHub users that are following the pull request submitter,
was used to represent the developers' standing within the community~\cite{Tsay2014Influence,Gousios:2014,yu16det}.
\vspace{0.2em}
\textit{Previous interaction.}
This metric (\texttt{prev\_interaction}) is the total number of events,
\eg such as commenting on issues and pull requests,
prior to the pull request submission
that the developer has participated in within the project~\cite{Tsay2014Influence,yu16det}.
\vspace{0.5em}
\noindent\textbf{Project-level characteristics.}
\vspace{0.2em}
\textit{Maturity.}
Previous studies used the metric \texttt{proj\_age},
\ie the period of time from the time the project was hosted on GitHub to the pull request submission time,
as an indicator of the project maturity~\cite{Tsay2014Influence,yu16det,Rahman2014An}.
However,
a project does not necessarily use the pull request model in the first place.
We also use the metric \texttt{prmodel\_age} to indicate
how long a project has adopted the pull request development model.
\vspace{0.2em}
\textit{Workload.}
The discussion of issues and pull requests might cost days to months to come to an end.
At any given time,
a bunch of open issues and pull requests
might be discussed simultaneously.
Prior studies have characterized project integrators' workload using two metrics:
\texttt{open\_tasks}~\cite{yu16det} and \texttt{team\_size}~\cite{Tsay2014Influence,Gousios:2014,yu16det},
which are the number of open issues and open pull requests at the pull request submission and
the number of active core team members during the last three months, respectively.
\vspace{0.2em}
\textit{Popularity.}
In measuring project popularity,
the metric \texttt{stars},
\ie the number of stars the project has got,
was commonly used in prior studies~\cite{Bor17Und,Tsay2014Influence}.
In addition,
we also considered three other popularity-related metrics:
\texttt{forks}, \texttt{prs}, and \texttt{contributors},
which are the number of forks, the number of pull requests, and the number of contributors
of the project, respectively.
\vspace{0.2em}
\textit{Activeness.}
We further measure the activeness of the project
by including four metrics:
\texttt{stars\_3M}, \texttt{forks\_3M}, \texttt{prs\_3M}, and \texttt{contributors\_3M},
which are the new stars, new forks, new pull requests, and active contributors in the project in the last three months, respectively.
\subsubsection{Comparative exploration}
@ -732,8 +738,8 @@ $H_{0}$: duplicate pull requests exhibit a value of metric $m$ equal to that one
\texttt{proj\_age}, \texttt{prmodel\_age},
\hl{\texttt{open\_tasks}}
\texttt{open\_issues}, \texttt{open\_prs}, \texttt{team\_size},
\texttt{forks}, \texttt{stars}, \texttt{prs}, \texttt{contributors},
\texttt{forks\_3M}, \texttt{stars\_3M}, \texttt{prs\_3M}, \texttt{contributors\_3M}\}
\texttt{forks}, \texttt{stars}, \texttt{prs}, \texttt{contributors}
\}
\vspace{0.2em}
@ -761,6 +767,56 @@ are different in terms of all metrics except for that they have similar number o
\toprule
\multicolumn{2}{r}{\textbf{Metric}} &\tabincell{c}{\textbf{Effect}\\\textbf{size}} &\tabincell{c}{\textbf{\textit{Adjusted}}\\\textbf{\textit{p-value}}}\\
\midrule
\multicolumn{4}{@{}l}{\textbf{Project-level characteristics}}\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{2}{*}{Maturity}
& \texttt{proj\_age}&0.166 & 2.95e-21 ***\\
& \texttt{prmodel\_age}& -0.108 &2.34e-11 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{3}{*}{Wordload}
& \texttt{open\_tasks}&0.177 &2.00e-37 ***\\
& \texttt{team\_size}& 0.257& 1.98e-41 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{4}{*}{Popularity}
& \texttt{forks}& -0.327& 3.06e-61 ***\\
& \texttt{watchers}& -0.289& 9.77e-47 ***\\
& \texttt{prs}& 0.187& 5.34e-19 ***\\
& \texttt{contributors}& -0.238& 3.71e-49 ***\\
% \cdashline{1-4}[0.8pt/2pt]
% \multirow{4}{*}{Activeness}
% & \texttt{forks\_3M}& -0.315& 6.39e-51 ***\\
% & \texttt{watchers\_3M}& -0.178& 2.07e-12 ***\\
% & \texttt{prs\_3M}& 0.280& 7.11e-85 ***\\
% & \texttt{contributors\_3M}& -0.115 & 2.31e-07 ***\\
\midrule
\multicolumn{4}{@{}l}{\textbf{Submitter-level characteristics}}\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{4}{*}{Experience}
& \texttt{prev\_prs\_porj}&0.281 & 5.46e-192 ***\\
& \texttt{prev\_prs}& 0.227& 1.82e-94 ***\\
& \texttt{first\_pr\_proj}& -0.435 & 3.21e-149 ***\\
& \texttt{first\_pr}& -0.200 & 3.91e-33 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{2}{*}{Standing}
& \texttt{core\_team}& 0.385&2.40e-117 ***\\
& \texttt{followers}&-0.002 & 8.78e-21 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{1}{*}{Interaction}
% & \texttt{followings}& -0.012 &3.21e-06 ***\\
% & \texttt{watch\_proj}&0.124 &1.28e-13 ***\\
& \texttt{prev\_interaction}&0.163 & 9.81e-108 ***\\
\midrule
\multicolumn{4}{@{}l}{\textbf{Patch-level characteristics}}\\
\cdashline{1-4}[0.8pt/2pt]
@ -781,54 +837,7 @@ are different in terms of all metrics except for that they have similar number o
&\texttt{change\_type}& 0.055& 9.73e-4 ***\\
&\texttt{activity\_type}& 0.055& 9.73e-4 ***\\
\midrule
\multicolumn{4}{@{}l}{\textbf{Submitter-level characteristics}}\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{4}{*}{Experience}
& \texttt{prev\_prs\_porj}&0.281 & 5.46e-192 ***\\
& \texttt{prev\_prs}& 0.227& 1.82e-94 ***\\
& \texttt{first\_pr\_proj}& -0.435 & 3.21e-149 ***\\
& \texttt{first\_pr}& -0.200 & 3.91e-33 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{2}{*}{Standing}
& \texttt{core\_team}& 0.385&2.40e-117 ***\\
& \texttt{followers}&-0.002 & 8.78e-21 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{3}{*}{Connection}
& \texttt{followings}& -0.012 &3.21e-06 ***\\
& \texttt{watch\_proj}&0.124 &1.28e-13 ***\\
& \texttt{prev\_interaction}&0.163 & 9.81e-108 ***\\
\midrule
\multicolumn{4}{@{}l}{\textbf{Project-level characteristics}}\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{2}{*}{Maturity}
& \texttt{proj\_age}&0.166 & 2.95e-21 ***\\
& \texttt{prmodel\_age}& -0.108 &2.34e-11 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{3}{*}{Wordload}
& \texttt{open\_issues}&0.177 &2.00e-37 ***\\
& \texttt{open\_prs}& 0.011& 4.71e-32 ***\\
& \texttt{team\_size}& 0.257& 1.98e-41 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{4}{*}{Popularity}
& \texttt{forks}& -0.327& 3.06e-61 ***\\
& \texttt{watchers}& -0.289& 9.77e-47 ***\\
& \texttt{prs}& 0.187& 5.34e-19 ***\\
& \texttt{contributors}& -0.238& 3.71e-49 ***\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{4}{*}{Activeness}
& \texttt{forks\_3M}& -0.315& 6.39e-51 ***\\
& \texttt{watchers\_3M}& -0.178& 2.07e-12 ***\\
& \texttt{prs\_3M}& 0.280& 7.11e-85 ***\\
& \texttt{contributors\_3M}& -0.115 & 2.31e-07 ***\\
\bottomrule
\end{tabularx}
@ -924,85 +933,53 @@ For project-level metrics,
\renewcommand{\arraystretch}{1.15}
\centering
\caption{\color{red}{Statistical models for the likelihood of duplicate pull requests}}
\begin{tabularx}{\textwidth}{@{}l r Y Y Y Y Y Y Y Y Y@{}}
\begin{tabularx}{\textwidth}{@{}r Y Y Y Y Y Y Y Y Y@{}}
\toprule
&& \multicolumn{3}{c}{\textbf{Model 1}}& \multicolumn{3}{c}{\textbf{Model 2}} & \multicolumn{3}{c}{\textbf{Model 3}}\\
& & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}& \multicolumn{3}{c}{response: \textit{is\_dup} = 1} & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}\\
\cmidrule(r){3-5} \cmidrule(r){6-8} \cmidrule(r){9-11}
&& Coeffs. & Errors & Signif. & Coeffs. & Errors & Signif.& Coeffs. & Errors & Signif.\\
& \multicolumn{3}{c}{\textbf{Model 1}}& \multicolumn{3}{c}{\textbf{Model 2}} & \multicolumn{3}{c}{\textbf{Model 3}}\\
& \multicolumn{3}{c}{response: \textit{is\_dup} = 1}& \multicolumn{3}{c}{response: \textit{is\_dup} = 1} & \multicolumn{3}{c}{response: \textit{is\_dup} = 1}\\
\cmidrule(r){2-4} \cmidrule(r){5-7} \cmidrule(r){8-10}
& Coeffs. & Errors & Signif. & Coeffs. & Errors & Signif.& Coeffs. & Errors & Signif.\\
\midrule
% \multicolumn{4}{@{}l}{\textbf{Project characteristics}}\\
\texttt{open\_tasks}& & & & & & & & &\\
\texttt{team\_size}& & & & & & & & &\\
\texttt{watchers}& & & & & & & & &\\
\midrule
% \multicolumn{4}{@{}l}{\textbf{Patch characteristics}}\\
\multirow{3}{*}{Size}
&\texttt{commits} & & & & & & & & &\\
& \texttt{files} & & & & & & & & & \\
& \texttt{churn}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{2}{*}{Text}
&\texttt{title\_len}& & & & & & & & &\\
& \texttt{desc\_len}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
Hotness &\texttt{hotness}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
Reference& \texttt{issue\_tag}& & & & & & & & &\\
\cdashline{1-4}[0.8pt/2pt]
\multirow{2}{*}{Type}
&\texttt{change\_type}& & & & & & & & &\\
&\texttt{activity\_type}& & & & & & & & &\\
\midrule
% \multicolumn{4}{@{}l}{\textbf{Submitter characteristics}}\\
% \cdashline{1-11}[0.8pt/2pt]
\multirow{4}{*}{Experience}
& \texttt{prev\_prs\_porj}& & & & & & & & &\\
& \texttt{prev\_prs}& & & & & & & & &\\
& \texttt{first\_pr\_proj}& & & & & & & & &\\
& \texttt{first\_pr}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{3}{*}{Status}
& \texttt{core\_team}& & & & & & & & &\\
& \texttt{followers}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{3}{*}{Connection}
& \texttt{followings}& & & & & & & & &\\
& \texttt{watch\_proj}& & & & & & & & &\\
& \texttt{prev\_interaction}& & & & & & & & &\\
\texttt{prev\_prs}& & & & & & & & &\\
\texttt{prev\_prs\_acc}& & & & & & & & &\\
\texttt{first\_pr\_proj}& & & & & & & & &\\
\texttt{core\_team}& & & & & & & & &\\
\texttt{followers}& & & & & & & & &\\
\texttt{prev\_interaction}& & & & & & & & &\\
\midrule
% \multicolumn{4}{@{}l}{\textbf{Project characteristics}}\\
% \cdashline{1-11}[0.8pt/2pt]
\multirow{2}{*}{Maturity}
& \texttt{proj\_age}& & & & & & & & &\\
& \texttt{prmodel\_age}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{3}{*}{Wordload}
& \texttt{open\_issues}& & & & & & & & &\\
& \texttt{open\_prs}& & & & & & & & &\\
& \texttt{team\_size}& & & & & & & & &\\
\midrule
% \multicolumn{4}{@{}l}{\textbf{Patch characteristics}}\\
\texttt{commits} & & & & & & & & &\\
\texttt{files} & & & & & & & & & \\
\texttt{churn}& & & & & & & & &\\
\texttt{title\_len}& & & & & & & & &\\
\texttt{desc\_len}& & & & & & & & &\\
\texttt{hotness}& & & & & & & & &\\
\texttt{issue\_tag}& & & & & & & & &\\
\texttt{change\_type}& & & & & & & & &\\
\texttt{activity\_type}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{4}{*}{Popularity}
& \texttt{forks}& & & & & & & & &\\
& \texttt{watchers}& & & & & & & & &\\
& \texttt{prs}& & & & & & & & &\\
& \texttt{contributors}& & & & & & & & &\\
\cdashline{1-11}[0.8pt/2pt]
\multirow{4}{*}{Activeness}
& \texttt{forks\_3M}& & & & & & & & &\\
& \texttt{watchers\_3M}& & & & & & & & &\\
& \texttt{prs\_3M}& & & & & & & & &\\
& \texttt{contributors\_3M}& & & & & & & & &\\
\midrule
\multicolumn{2}{r}{Area Under the ROC Curve:} &\multicolumn{3}{c}{0.661}& \multicolumn{3}{c}{0.845} & \multicolumn{3}{c}{0.871}\\
\multicolumn{1}{r}{Area Under the ROC Curve:} &\multicolumn{3}{c}{0.661}& \multicolumn{3}{c}{0.845} & \multicolumn{3}{c}{0.871}\\
\bottomrule
\end{tabularx}

View File

@ -78,7 +78,7 @@ if they can fulfil developers' actual demands in maintaining awareness.
\vspace{0.5em}
\noindent \textbf{A mismatch between awareness information and actual outcomes/***.}
\noindent \textbf{A mismatch between awareness *** actual information exchange.}
Maintaining awareness is dual.
Intuitively,
it means developers need to \textit{gather external information} to stay aware of others' activities.
@ -90,7 +90,9 @@ This hinders other developers' ability to gather adequate contextual information
Although prior work~\cite{Treude2010Awareness,Arora2016Supporting,Calefato2012Social} has extensively studied on
how to help developers track work and get information,
more research attention should be paid to encouraging developers to share awareness information.
% 不share是不是因为性格是不是因为working style
For example,
it would be interesting to investigate
where developers' willingness to share information is affected by the characteristics of collaboration mechanisms and communication tools.
Obviously,

View File

@ -0,0 +1,127 @@
setwd("~/Desktop")
#########################################################
# load data
rq2_metrics <- read.csv("rq2_metrics.csv")
summary(rq2_metrics)
# remove NA data
rq2_data = rq2_metrics
rq2_data[is.na(rq2_data$prev_pr_acc_proj),]$prev_pr_acc_proj = 0
rq2_data[is.na(rq2_data$prev_prs_acc),]$prev_prs_acc = 0
rq2_data[is.na(rq2_data$open_issues),]$open_issues = 0
summary(rq2_data)
# prev_prs_acc_projΪnullֱ??ȥ???Dz??Dz?̫??
rq2_m <- rq2_data[complete.cases(rq2_data),]
summary(rq2_m)
#########################################################
#########################################################
# categorical metrics
rq2_m$is_dup = as.logical(rq2_m$is_dup)
rq2_m$file_type = as.factor(rq2_m$file_type)
rq2_m$first_pr_proj = as.logical(rq2_m$first_pr_proj)
rq2_m$first_pr = as.logical(rq2_m$first_pr)
rq2_m$core_team = as.logical(rq2_m$core_team)
rq2_m$watch_proj = as.logical(rq2_m$watch_proj)
rq2_m$issue_tag = as.logical(rq2_m$issue_tag)
rq2_m$prj_id = as.factor(rq2_m$prj_id)
rq2_m$change_type = as.factor(rq2_m$change_type)
#########################################################
summary(rq2_m)
rq2_m1 = subset(rq2_m, loc<100000 & desc_len<10000)
library(pROC)
library(car)
library(lme4)
#setp models----------------
#proj level---------------
gbg_1 = glmer(formula = is_dup ~
log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
+ log(hotness+0.5)
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_1)
summary(gbg_1)
prob=predict(gbg_1, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a
#proj level and submitter level---------------
gbg_2 = glmer(formula = is_dup ~
log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
+ log(hotness+0.5)
+ log(prev_pullreqs+0.5)
+ log(prev_prs_acc + 0.5)
+ first_pr_proj
+ log(followers+0.5)
+ core_team
+ log(prior_interaction+0.5)
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_2)
summary(gbg_2)
prob=predict(gbg_2, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a
#proj level, submitter level and PR level---------------
gbg_3 = glmer(formula = is_dup ~
log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
+ log(hotness+0.5)
+ log(prev_pullreqs+0.5)
+ log(prev_prs_acc + 0.5)
+ first_pr_proj
+ log(followers+0.5)
+ core_team
+ log(prior_interaction+0.5)
+ log(commits+0.5) + log(files_changed+0.5) + log(loc+0.5)
+ log(title_len+0.5)
+ log(desc_len + 0.5)
+ log(title_len+desc_len)
+ issue_tag
+ change_type
+ file_type
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_3)
summary(gbg_3)
prob=predict(gbg_3, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a

View File

@ -0,0 +1,127 @@
setwd("~/Desktop")
#########################################################
# load data
rq2_metrics <- read.csv("rq2_metrics.csv")
summary(rq2_metrics)
# remove NA data
rq2_data = rq2_metrics
rq2_data[is.na(rq2_data$prev_pr_acc_proj),]$prev_pr_acc_proj = 0
rq2_data[is.na(rq2_data$prev_prs_acc),]$prev_prs_acc = 0
rq2_data[is.na(rq2_data$open_issues),]$open_issues = 0
summary(rq2_data)
# prev_prs_acc_proj为null直接去掉是不是不太好
rq2_m <- rq2_data[complete.cases(rq2_data),]
summary(rq2_m)
#########################################################
#########################################################
# categorical metrics
rq2_m$is_dup = as.logical(rq2_m$is_dup)
#rq2_m$pr_type = as.factor(rq2_m$pr_type)
rq2_m$file_type = as.factor(rq2_m$file_type)
rq2_m$first_pr_proj = as.logical(rq2_m$first_pr_proj)
rq2_m$first_pr = as.logical(rq2_m$first_pr)
rq2_m$core_team = as.logical(rq2_m$core_team)
rq2_m$watch_proj = as.logical(rq2_m$watch_proj)
rq2_m$issue_tag = as.logical(rq2_m$issue_tag)
rq2_m$prj_id = as.factor(rq2_m$prj_id)
rq2_m$change_type = as.factor(rq2_m$change_type)
#########################################################
summary(rq2_m)
rq2_m1 = subset(rq2_m, loc<100000 & desc_len<10000)
library(pROC)
library(car)
library(lme4)
#setp models----------------
#proj level---------------
gbg_1 = glmer(formula = is_dup ~
log(proj_age+0.5)
+ log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_1)
summary(gbg_1)
prob=predict(gbg_1, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a
#proj level and submitter level---------------
gbg_2 = glmer(formula = is_dup ~
log(proj_age+0.5)
+ log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
#+ log(forks_3M+0.5)
#+ log(watchers_3M+0.5)
#+ log(pullreqs_3M+0.5)
+ log(prev_pullreqs+0.5)
+ first_pr_proj
#+ log(prev_pullreqs_proj + 0.5)
+ log(prev_pr_acc_proj + 0.5)
+ log(followers+0.5)
+ core_team
+ log(prior_interaction+0.5)
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_2)
summary(gbg_2)
prob=predict(gbg_2, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a
#proj level, submitter level and PR level---------------
gbg_3 = glmer(formula = is_dup ~
log(proj_age+0.5)
+ log(open_tasks+0.5)
+ log(team_size+0.5)
+ log(watchers+0.5)
+ log(prev_pullreqs+0.5)
+ log(prev_prs_acc + 0.5)
+ first_pr_proj
+ log(followers+0.5)
+ core_team
+ log(prior_interaction+0.5)
+ log(hotness+0.5)
+ log(commits+0.5) + log(files_changed+0.5) + log(loc+0.5)
+ log(title_len+desc_len)
+ issue_tag
+ change_type
+ file_type
+ (1|prj_id),
data=rq2_m1,
family="binomial"
)
vif(gbg_3)
summary(gbg_3)
prob=predict(gbg_3, type=c("response"))
rq2_m1$prob=prob
a = roc(is_dup ~ prob, data = rq2_m1)
a