duppr_analysis/5_discussion.tex

553 lines
26 KiB
TeX

\section{Discussion}
\label{sec:ds}
Based on our analysis results and findings,
we now discuss recommendations and implications
for OSS practitioners.
{\color{hltext}
\subsection{Main findings}
% coordination
% drive-by pr (not have the patience to check)
\subsubsection{Awareness breakdown}
Maintaining awareness in global distributed development is a significant concern~\cite{Treude2010Awareness,Gutwin2004Group}.
Developers need to pursue \textit{``an understanding of the activities of others, which provides a context for your own activity''}~\cite{dourish1992awareness}.
As the most popular collaborative development platform,
GitHub has centralized information about project statuses and developer activities
and made them transparent and visible~\cite{Dabbish2012Social},
which help developers to maintain awareness with less effort~\cite{Kalliamvakou2015Open}.
However,
awareness breakdown still occurs and results in duplicate work.
Our analysis of the specific context where duplicates are produced,
as shown in Section~\ref{ss:contextanalysis},
reveal
three mismatches leading to awareness breakdown.
\vspace{0.5em}
\noindent \textbf{A mismatch between awareness requirements and actual activities.}
In most community-based OSS projects,
developers are self-organized~\cite{Crowston2005Coordination,bird2008latent},
and are allowed to work on any part ot the code according to individual's interest and time\cite{Gutwin2004Group, lakhani2003hackers}.
Awareness requirements arise
whenever a developer work on an issue
because other developers might self-assign themselves to the same issue.
However,
our findings show that
some contributors lack sufficient effort investment in awareness activities
({\footnotesize Section~\ref{ss:contextanalysis}: \textit{\textbf{Not searching for existing work}}, \textit{\textbf{Overlooking linked pull requests}}, and \textit{\textbf{Overlooking existing claim}}}).
We assume that this is due to the volunteer nature of OSS participation.
For some developers,
especially the casual contributors and one time contributors,
a major motivation to make contributions is to \textit{``scratch their own itch''}~\cite{pinto2016more,lee2017understanding}.
When they encounter a problem,
they code a patch to fix it and send the patch back to the community.
Some of them even do not care about the final outcome of their pull requests~\cite{steinmacher2018almost}.
It might be harder to get them spend more time to maintain awareness
of other developers.
Automatic awareness tools can mitigate this problem.
Prior research has proposed to automatically detect duplicates at pull request submission~\cite{li2017detecting,ren2019identifying} and identify ongoing features from forks~\cite{zhou2018identifying}.
Furthermore,
we advocate for future research on seamlessly integrating awareness tools to developers' development environment
and designing intelligent and non-intrusive notification mechanism.
\vspace{0.5em}
\noindent \textbf{A mismatch between awareness mechanisms and actual demands.}
% passive and active
Currently,
GitHub provides developers with a wide range of mechanisms,
\eg following developers and watching projects~\cite{Dabbish2012Social},
to maintain a general awareness about project status.
% in terms of historical, current, and even upcoming events.
However,
developers can be overwhelmed with a large-scale of incoming events in popular projects
({\footnotesize Section~\ref{ss:contextanalysis}: \textit{\textbf{Missing notifications}}}).
It is also impractical for developers to always maintain overall awareness of a project
due to multitasking~\cite{vasilescu2016sky} and turnover~\cite{Lin2017Developer}.
Usually,
developers need to obtain on-demand awareness around a specific task
whenever deciding to submit a pull request,
\ie gathering task-centric information to figure out people interested in the same task.
Currently,
the main mechanisms to meet this demand
are querying issue and pull request list and reading through the discussion history.
As mentioned in Section~\ref{ss:contextanalysis},
the support by these mechanisms is not as adequate as expected
due to information mixture
({\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Overlooking linked pull requests}} and \textbf{\textit{Overlooking existing claims}}})
and other technical problems
({\footnotesize Section~\ref{ss:contextanalysis}:
\textbf{\textit{Disappointing search functionality}} and \textbf{\textit{Diversity of natural language usages}}}).
Awareness mechanisms would be most useful
if they can fulfil developers' actual demands in maintaining awareness.
\vspace{0.5em}
\noindent \textbf{A mismatch between awareness maintenance and actual information exchange.}
Maintaining awareness is
% dual.
bidirectional.
Intuitively,
it means that developers need to \textit{gather external information} to stay aware of others' activities.
But from a global perspective,
it also means developers should actively \textit{share their personal information} that can be gathered by others.
Our findings show that some developers do not timely announce their plans
({\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Implementing without claiming first}}})
and link their work to the associated issue
({\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Lack of links}}}).
This hinders other developers' ability to gather adequate contextual information.
Although prior work~\cite{Treude2010Awareness,Arora2016Supporting,Calefato2012Social} has extensively studied on
how to help developers track work and get information,
more research attention should be paid to encouraging developers to share awareness information.
For example,
it would be interesting to investigate
where developers' willingness to share information is affected by the characteristics of collaboration mechanisms and communication tools.
Obviously,
awareness tools are important for OSS developers to stay aware of each other.
However,
no tool or mechanism can prevent all awareness breakdown entirely.
A better understanding of the importance of group awareness
and better use of available technologies can
help developers ensure that their individual contributions do not cause accidental duplicates.
\subsubsection{The paradox of duplicates}
Generally speaking,
both project integrators and contributors hope to prevent duplicate pull requests,
because duplicates can waste their time and effort,
as shown in Section~\ref{rq1}.
This is also reflected in their comments,
\eg
\textit{``it probably makes sens to just center around a single effort''},
\textit{``No need to do the same thing in two PRs''},
and \textit{``Oops! Sorry, did not mean to double up''}.
However,
when duplicates are already produced,
some potential value might be mined from them
% except for the negative impacts,
as shown in Section~\ref{beyond}.
% 在管理者看来
\vspace{0.5em}
\noindent \textbf{Redundancy vs. alternative.}
In many cases,
duplicate pull requests change pretty much the same codes,
which only bring unnecessary redundance.
While in some cases,
duplicates implemented in different approaches
provide alternative solutions,
as a developer put it:
\textit{``The pull requests are different, so maybe it is good there are two.''}
In such cases,
project integrators have a higher chance to accept a better patch.
However,
this comes at a price.
Integrators have to invest more time to compare the details of duplicates
in order to clearly disclose the difference between them.
Ensuring that their effort is not wasted in coping with duplicates,
but maximizing the disclosure and adoption of additional value provided by each duplicate,
is a trade-off integrators should be aware of.
% 对贡献者来说
\vspace{0.5em}
\noindent \textbf{Competition vs. collaboration.}
At first sight,
authors of duplicate pull requests
face a competition in getting their own patches accepted.
For example,
one contributor tried to persuade integrators to choose his pull request rather than another one:
\textit{``Pick me! Pick me! I was first! :)''}.
Nevertheless,
we found some cases where
the authors of duplicates worked together towards a better patch by means of mutual assessment and
negotiated incorporation,
as shown in Section~\ref{beyond}.
According to developers,
the collaboration is also an opportunity for both the authors
to learn from each other's strength
(\textit{`` I looked at some awesome code that xxx wrote to fix this issue and it was so simple, I just did not fully understand the issue I was fixing.''}).
% and fosters a friendly collaborative environment.
Standing for their own patches,
but seeking for collaboration and learning,
is a trade-off the authors of duplicates should be aware of.
\subsubsection{Decision-making in Either/Or situations}
% Previous studies on pull request acceptance suggest that
% there are many more factors that influence contribution evaluation.
In the general cases of pull request evaluation,
integrators are answering a \textit{Yes/No} question \textit{``whether to accept this pull request?''},
and the considered factors are mainly relating to individual pull requests.
While in the cases of making a decision between duplicates,
integrators are answering an \textit{Either/Or} question \textit{``whether to choose this duplicate pull request or the other one?''}.
Our findings about integrators' choice between duplicates,
as presented in Section~\ref{rq3},
show that in context of the selection between duplicates,
additional factors are considered by integrators,
and the effects of some factors are opposite to
what were observed in the context of general pull request evaluation.
\vspace{0.5em}
\noindent \textbf{Community fairness matters:}
Our findings confirm the results of previous studies that
pull requests with accurate and high-quality implementation and submitted by experienced developers
are more likely to be accepted~\cite{yu16det,gousios2015work,zou2019how,Kononenko2018Studying}.
However,
the submitter's standing and the social connection between the submitter with integrators,
do not have as strong effects on the acceptance of duplicate pull request as observed in prior work~\cite{Tsay2014Influence,yu16det}.
Instead,
we observe that a more objective factor, \ie the arrival order of duplicates,
presents a strong effect.
Both our regression model
% (predictor \texttt{early\_arrival})
and manual observation
% (reason \textbf{First-come, First-served})
show that
duplicate pull requests arriving earlier than their counterparts are more likely to be accepted.
% This factor belongs to neither social metric nor technical metric.
Respecting the arrival order is like a default community rule
developers usually follow to manage duplicates~\cite{sun2011towards,sun2010discriminative,stackexchange-dup}.
% A similar treatment \textit{``first-in, first-out''} was also observed
% when integrators prioritize their work on pull request evaluation~\cite{gousios2015work}.
It might reveal that integrators focus more on the technical metrics and objective difference
when making decisions between duplicates
in order to ensure fairness within the community~\cite{Jensen2007Role, baysal2012secret}.
% explaining rejection big challenge
\vspace{0.5em}
\noindent \textbf{Effort investment matters:}
% revision; in_line comments
While prior work on pull request acceptance has suggested that highly discussed pull requests are less likely to be accepted~\cite{Tsay2014Influence}.
Our model shows that
duplicate pull requests with more inline comments
% (predictor \texttt{comments\_inline})
and revisions
% (predictor \texttt{revisions})
have higher chances of acceptance.
A higher number of review comments and revisions might indicate that
the pull request is not perfect and integrators request changes to improve it.
However,
it also reflects that the pull request has been thoroughly reviewed,
and both integrators and the submitter have invested considerable effort.
If the pull request does not have fatal drawbacks and the other duplicate pull request do not provide significant benefits,
integrators should prefer the thoroughly reviewed pull request rather than the shallowly reviewed one.
After all,
the top challenge faced by integrators is time~\cite{gousios2015work},
and it is reasonable for them to avoid putting more redundant effort on the shallowly reviewed duplicates.
Moreover,
asking for more work from contributors to improve their pull requests might be difficult~\cite{gousios2015work,steinmacher2018almost}.
Thus,
choosing the thoroughly discussed duplicate pull requests of high maturity
% (\ie higher \texttt{revisions})
might be a safe decision.
}
\subsection{Suggestions for contributors}
To avoid unintentional duplicate pull requests,
contributors may follow a set of best contributing practices
when they are involved in the pull-based development model.
\vspace{0.5em}
\noindent \textbf{Adequate checking:}
{\color{hltext}
Many duplicates were produced because
contributors did not conduct adequate checking to
make sure that no one else is working on the same thing ({\footnotesize Section~\ref{ss:contextanalysis}:
\textbf{\textit{Not searching for existing work}}, \textbf{\textit{Overlooking linked pull requests}}, and \textbf{\textit{Overlooking existing claims}}}).
We recommend that
contributors should perform at least three kinds of checking:
\textit{i)} reading through the whole discussion of an issue and checking whether anyone has claimed the issue;
\textit{ii)} examining each of the pull requests linked to an issue and checking whether any of them is an ongoing work to solve the issue; and
\textit{iii)} performing searches with different keywords against open and closed pull requests and issues,
and carefully checking where similar work already exists.
}
\vspace{0.5em}
\noindent \textbf{Timely completion:}
Quite a number of OSS developers contribute to a project at their spare time,
and some of them even switch between multiple tasks.
As a result,
it might be difficult for them to complete an individual task in a timely fashion.
However,
we still suggest that contributors should quickly accomplish each work in proper order, \eg one item at a time, to shorten their local duration.
This can make their work publicly visible earlier,
which can, to some extent, prevent others from submitting duplicates
({\color{hltext}
{\footnotesize Section~\ref{ss:contextanalysis}:
\textbf{\textit{Overlong local work}}
}}).
\vspace{0.5em}
\noindent \textbf{Precise context:}
Providing complete and clear textual information for submitted pull requests is helpful
for other contributors to retrieve these pull requests and acquire an accurate and comprehensive understanding of them
({\color{hltext}
{\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Diversity of natural language usage}}
}}).
In addition,
if a pull request is solving a tracked issue,
adding the issue reference in the pull request description,
\textit{e.g., ``fix \#[issue\_number]'',}
% helps to aggregate related activities concerning the same issue.
{\color{hltext}
can avoid some duplicates because of
the increasing degree of awareness.
%other developers would have a chance to find the linked pull request
( {\footnotesize Section~\ref{ss:contextanalysis}:
\textbf{\textit{Lack of links}}
}).
}
\vspace{0.5em}
\noindent \textbf{Early declaration:}
{\color{hltext}
Zhou \etal~\cite{zhou2019fork} already suggested that
claiming an issue upfront is associated with a lower chance of redundant work.
We would like to emphasize again the importance of early declaration base our findings ({\color{hltext}
{\footnotesize Section~\ref{ss:contextanalysis}:
\textbf{\textit{Implementing without claiming first}}
}}).
If contributors decide to submit a pull request to an issue,
they are better to first declare their intentions before coding,
instead of reporting their patches after the work is done.
Compared with late report,
early declaration can timely
broadcast contributors' intention to the community to get the attention of interested parties,
so that they can avoid accidental duplicate work }
and the core team has a chance to coordinate contributors' work,
as an integrator stated
\textit{``@xxx, btw, it is a good idea to comment on an issue when you start working on it, so we can coordinate better and avoid duplication of effort''}.
{\color{hltext}
\vspace{0.5em}
\noindent \textbf{Argue for their PRs:}
Integrators might consider various factors when making decisions between duplicates,
as shown in Section~\ref{rq3}.
Contributors should actively argue for their own duplicate pull requests
by explicitly stating the strength of their patches,
especially if they have proposed a different approach and provided additional benefits.
Authors of duplicates can review each of other's patch and discuss the difference between their patches
before waiting for an official statement from the core team.
This can provide a solid basis for integrators to make informed decisions about which duplicate should be accepted.
Moreover,
if the value of a duplicate pull request has been explicitly stated,
even it is finally closed,
its useful part has a higher chance to be noticed and cherry-picked by integrators,
as shown in Section~\ref{beyond}.
}
\subsection{Suggestions for core team}
The core team of an OSS project,
acting as the integrator and maintainer of the project,
is responsible for establishing contribution standards
and coordinating contributors' development.
To achieve the long-term and continuous survival of the project,
the core team may also follow some best practices.
\vspace{0.5em}
\noindent \textbf{Evident guidelines:}
{\color{hltext}
OSS projects,
especially popular projects,
usually have a contribution guideline
telling contributors how to report an issue, submit a pull request and \etc
We found many of them have warned contributors not to submit duplicate issues and pull requests.
Since
many developers said they did not or forgot to check for existing work
({\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Not searching for existing work}}, \textbf{\textit{Overlooking linked pull requests}}, and \textbf{\textit{Overlooking existing claims}}
}).
projects can make the warning more visible (\eg highlighted in bold) and
easy-follow (\eg clearly itemizing the behaviors should be taken to check or existing pull requests).
}
\vspace{0.5em}
\noindent \textbf{Explaining decisions:}
Integrators must make a choice between duplicate pull requests,
which means that they have to reject someone.
For contributors whose pull requests have been rejected,
they might be pleased to get feedback and explanation about why their work has been rejected rather than simply closing their pull requests.
% After all,
% they have devoted time and energy to their contribution.
{\color{hltext}
However,
we observed nearly 50\% of our qualitative samples where decisions were made without any explanation (as shown in Table~\ref{tab:reason}).
Even worse,
we identified that the rough explanation (\eg \textit{``Thanks for your PR but this fix is already merged in \#20610''}) would be likely to make the contributor upset
(\textit{``'already' implies I submitted my PR later than that, rather than nearly a year earlier ;) But at least it's fixed.''}).
In that case, the integrator had to give an additional apology (\textit{``sorry, sometimes a PR falls in the cracks and a newer one gets the attention. We have improved the process in hopes to avoid this but we still have a big backlog in which these things are present.''}) to mitigate the negative effect.
In the future, a careful analysis should be designed to examine the effectiveness of this suggestion based on controlled experiments.
}
\subsection{Suggestions for design of platforms}
Online collaborating platforms such as GitHub have designed and provided numerous mechanisms and tools to support OSS development.
However,
the practical problem of duplicate contributions proves that the platforms need to be improved.
\vspace{0.5em}
\noindent \textbf{Claim button:}
% As discussed in Section~\ref{ss:contextanalysis},
% in the pull-based development model,
% contributors first finish local work and then submit a pull request for online evaluation,
% which can be described as the \textit{behave-then-report} workflow.
% This workflow makes other contributors aware of a contributor's activities mostly after her/his work is done.
% Perhaps platforms can come up with a new mechanism
% to achieve early disclosure of contributors' intended work.
% In our opinion,
% the \textit{declare-then-behave} workflow is probably a workable solution;
% it requires contributors to first declare their intention and then conduct concrete work before finally submitting a pull request.
% This is similar to the \textit{pullrequest-then-commit} workflow
% discussed in Section~\ref{ss:contextanalysis}.
% GitHub can make it a platform-supporting workflow and provide corresponding functionalities.
{\color{hltext}
% As discussed in Section~\ref{ss:contextanalysis},
% in current behave-then-report workflow (\ie contributors first finish local work and then submit a pull request for online evaluation),
% developers are usually \hl{behindhand} aware of others' intention after the work is already finished.
In order to make it more efficient for developers to maintain awareness of each other,
we envision a new mechanism called \textit{Claim} which is described as follows.
For a GitHub issue,
each developer who is interested in it
can click the \textit{Claim} button on the issue page to claim that s/he is going to work on the issue.
The usernames of all claimers of the issue
are listed together below the \textit{Claim} button.
Every time the \textit{Claim} button is clicked,
an empty pull request is automatically generated
and linked to the claimer's username in the issue claimer list.
Moreover,
claimers have a chance to report their plans about how to fix the issue in the input box displayed when the \textit{Claim} button is clicked.
The reported plans would be used to describe the empty pull request.
Subsequently,
claimers perform updates of the empty pull request by submitting codes
until they produce a complete patch.
All important updates on the empty pull request,
\eg new commits pushed,
would be displayed in the claimer list.
On the one hand,
this mechanism makes it more convenient for developers to
share their intentions and activities through just clicking a button.
On the other hand,
developers can efficiently catch and track other developers' intentions and activities
by checking the issue claimer list.
}
\vspace{0.5em}
\noindent \textbf{Duplicate detection: }
As contributors complained, e.g.,
% cmt_id: 2538323
\textit{``... I wish there has been some automated method to detect pending PR per file basis. This could save lot of work duplicacy. ...''}, or
\textit{`` It's strange that GitHub isn't complaining about this, because it's an exact dup of \#5131 which was merged already''},
an automatic detection tool of duplicates is missing in GitHub.
Such a tool can help integrators detect duplicates in a timely manner
and prevent them spending resources on the redundant effort of evaluating duplicates separately.
Therefore,
GitHub can learn from Stack Overflow and Bugzilla to recommend similar work when developers are creating pull requests by utilizing various similarity measures, \eg title and code changes.
The features discussed in Section~\ref{ss:metrics} can also be integrated to enhance the recommendation system.
{\color{hltext}
\vspace{0.5em}
\noindent \textbf{Reference reminder:}
Since developers might overlook linked pull requests to issues
({\color{hltext}
{\footnotesize Section~\ref{ss:contextanalysis}: \textbf{\textit{Overlooking linked pull requests}}}}),
platforms can actively remind developers of existing pull requests linked to the same issue at pull request submission time.
The goal of this functionality is similar to that of the duplicate detection tool,
However, it can be implemented in a more straightforward way.
For example,
when developers are filling a pull request and adding a reference to the associated GitHub issue,
a pop-up box can be displayed next to the issue reference
to list the existing pull requests linked to that issue.
}
\vspace{0.5em}
\noindent \textbf{Duplicate comparison: }
As discussed in Section~\ref{rq3},
when integrators make a choice between duplicate pull requests,
they consider several factors.
Platforms can support duplicate comparison to make the selection process more efficient.
For example,
platforms can automatically extract several features of compared duplicates,
\eg inclusion of test codes and the contributor's experience,
and display these features in a comparison format to
clearly show the difference between duplicate pull requests
and speed up the selection process.
{\color{hltext}
\vspace{0.5em}
\noindent \textbf{Online incorporation: }
{\color{hltext} As presented in Section~\ref{beyond},
integrators might promote the incorporation of duplicate pull requests.}
Currently,
the typical way to incorporate a pull request $PR_{i}$ into another pull request $PR_{j}$ is as follows:
\textit{i)}
adding the head branch of $PR_{i}$ as a remote branch in the corresponding local repository of $PR_{j}$,
\textit{ii)}
fetching the remote branch to the local repository,
\textit{iii)}
cherry-picking the needed commits from or rebase onto the remote branch,
and
\textit{iv)}
updating $PR_{j}$ by synchronizing the changes from local repository to the head branch of $PR_{i}$.
% Sometimes the author of $PR_{j}$ might be asked to
% add the author name of $PR_{i}$ in the latest commit message or project changelog to give credit
% for the incorporated codes.
Sometimes developers also need to update the commit message or project changelog
to give credit for the incorporated codes.
This process can be too complex for newcomer to undertake.
Moreover,
this process looks tedious for incorporating trivial changes.
GitHub can support online incorporation of duplicates pull requests.
For example,
it can allow developers to pick the needed codes by clicking buttons in the UI,
and the credit is given to the picked codes
by automatically updating the commit message and changelog.
% Therefore,
% GitHub can support online incorporation of pull requests
% so that developers can easily pick the needed codes from the displayed diff snippets.
% Furthermore,
% the incorporated codes can be squashed into a separate commit to make it more effective for credit record.
}