126 lines
5.1 KiB
TeX
Executable File
126 lines
5.1 KiB
TeX
Executable File
\section{APPROACH}
|
|
\label{sec:approach}
|
|
|
|
The goal of our work is automatically detecting duplicate pull-requests.
|
|
As shown in Figure~\ref{fig:approach},
|
|
for a given pull-request, our method returns a list of candidate duplicate pull-requests
|
|
by computing and ranking the textual similarities between it and other history pull-requests.
|
|
We determine the textual similarity between pull-requests from different perspectives:
|
|
\textit{title similarity}, \textit{description similarity} and their combination.
|
|
In the following sections, we will elaborate each step in detail.
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\includegraphics[width=8.5cm]{resources/approach_color.png}
|
|
\caption{Overall framework of our method}
|
|
\label{fig:approach}
|
|
\end{figure}
|
|
|
|
|
|
\subsection{Calculating Textual Similarity}
|
|
Our method adopts the traditional natural language processing (NLP) techniques~\cite{Kantor1999}
|
|
to calculate textual similarity between two pull-requests.
|
|
We mainly use the text information from the title and description of pull-requests.
|
|
|
|
\textbf{Preprocessing.} We perform the standard preprocessing in the extracted text
|
|
including tokenization, stemming and stop words removal.
|
|
Different strategies can be applied to split a sentence into tokens
|
|
depending on the type of data and application domain~\cite{Runeson2007Detection}.
|
|
There are some types of text which are split into multiple tokens in common settings
|
|
but we want to treat them as a single token in the context of pull-requests.
|
|
For example, code paths and hyper links usually indicate one concept
|
|
and they should not be divided into separate words.
|
|
As a result of that, we use the regular tokenizer to parse the raw text.
|
|
The following are some example regular expressions and the matched text.
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item code path
|
|
\begin{itemize}
|
|
\item \footnotesize{\texttt{$\backslash$w+(?:$\backslash$:$\backslash$:$\backslash$w+)*}}
|
|
\item \textit{``ActionDispatch::Http::URL''}
|
|
\end{itemize}
|
|
|
|
\item number of pull-requests
|
|
\begin{itemize}
|
|
\item \footnotesize{\texttt{$\backslash$\#$\backslash$d+}}
|
|
% \item $thanks?|thxs?|:(\backslash w+)?heart:$
|
|
\item \textit{``\#10319''}
|
|
\end{itemize}
|
|
|
|
\end{itemize}
|
|
|
|
After tokenization, each word will be stemmed to its root form
|
|
(\eg \textit{``was''} to \textit{``be''} and \textit{``errors''} to \textit{``error''})
|
|
with the help of Porter stemming algorithm~\cite{Porter1997An}.
|
|
Finally, common stop words (\eg \textit{``the'', ``we'', ``a''}),
|
|
which appear so frequently that they contribute very few
|
|
in distinguishing different documents,
|
|
will be removed.
|
|
|
|
|
|
\textbf{Transformation.}
|
|
We then transform the preprocessed text into multi-dimensional vector
|
|
which is computable in Vector Space Model (VSM).
|
|
A text is represented as:
|
|
$TextVec_{i} = (w_{i,1},w_{i,2}, ..., w_{i,v})$.
|
|
Each dimension of the vector corresponds to an unique word in the
|
|
corpus built by all the text.
|
|
The value of $w_{i,k}$,
|
|
the weight of the \textit{k}-th item of the vecotr of \textit{i}-th text,
|
|
is computed by the TF-IDF model~\cite{Sun2010A}:
|
|
|
|
\begin{equation}
|
|
w_{i,k} = tf_{i,k} \times idf_{k}
|
|
\label{equ:tw}
|
|
\end{equation}
|
|
|
|
In the above formula,
|
|
$tf_{i,k}$ denotes the \textit{term frequency}
|
|
which is the frequency of \textit{k}-th term appearing in the \textit{i}-th text
|
|
and $idf_{i,k}$ denotes the \textit{inverse term frequency}
|
|
which measures the distinguishing characteristic of a term.
|
|
|
|
\textbf{Similarity.}
|
|
After transforming text into vectors,
|
|
we compute the similarity between a pair of text
|
|
using \textit{Cosine}~\cite{Kantor1999} measure
|
|
which is calculated by the following formula:
|
|
\begin{equation}
|
|
\begin{aligned}
|
|
Sim(i,j) &= \frac{TextVec_{i} \cdot TextVec_{j}}
|
|
{\vert TextVec_{i} \vert \vert TextVec_{j} \vert} \\ \\
|
|
&= \frac{\sum_{m=1}^{m=v}(w_{i,m}\times w_{j,m})}
|
|
{ \sqrt{\sum_{m=1}^{m=v} w_{i,m}^{2}} \sqrt{\sum_{m=1}^{m=v} w_{j,m}^{2}}}
|
|
\label{equ:sim_text}
|
|
\end{aligned}
|
|
\end{equation}
|
|
|
|
By applying the this measure,
|
|
we can obtain two kinds of similarities between two pull-requests:
|
|
$Sim_{title}(i,j)$ and $Sim_{desc}(i,j)$.
|
|
$Sim_{title}(i,j)$ measures the similarity between titles
|
|
while $Sim_{desc}(i,j)$ measures the similarity between descriptions.
|
|
|
|
\subsection{Ranking Candidate List}
|
|
After $Sim_{title}$, and $Sim_{desc}$
|
|
between pull-request pairs are calculated,
|
|
we are able to retrieve the target pull-requests according to
|
|
these two kinds of similarities.
|
|
To produce the candidate list of top-k pull-requests for the given pull-request,
|
|
we use the combined similarity~\cite{Wang2008}
|
|
computed by the following formula.
|
|
|
|
\begin{equation}
|
|
SimText(i,j) = Sim_{title}(i,j) + Sim_{desc}(i,j)
|
|
\label{equ:sim_text}
|
|
\end{equation}
|
|
|
|
|
|
In the formula, we use a straightforward heuristic
|
|
computing the arithmetic average of $Sim_{title}(i,j)$ and $Sim_{desc}(i,j)$
|
|
to get the final textual similarity which is the most widely used combination function.
|
|
Finally, the top-k pull-requests which get the most-high $Sim$ with the given pull-request
|
|
will be returned in the candidate list.
|