internetware2017/3_approach.tex

126 lines
5.1 KiB
TeX
Executable File

\section{APPROACH}
\label{sec:approach}
The goal of our work is automatically detecting duplicate pull-requests.
As shown in Figure~\ref{fig:approach},
for a given pull-request, our method returns a list of candidate duplicate pull-requests
by computing and ranking the textual similarities between it and other history pull-requests.
We determine the textual similarity between pull-requests from different perspectives:
\textit{title similarity}, \textit{description similarity} and their combination.
In the following sections, we will elaborate each step in detail.
\begin{figure}[ht]
\centering
\includegraphics[width=8.5cm]{resources/approach_color.png}
\caption{Overall framework of our method}
\label{fig:approach}
\end{figure}
\subsection{Calculating Textual Similarity}
Our method adopts the traditional natural language processing (NLP) techniques~\cite{Kantor1999}
to calculate textual similarity between two pull-requests.
We mainly use the text information from the title and description of pull-requests.
\textbf{Preprocessing.} We perform the standard preprocessing in the extracted text
including tokenization, stemming and stop words removal.
Different strategies can be applied to split a sentence into tokens
depending on the type of data and application domain~\cite{Runeson2007Detection}.
There are some types of text which are split into multiple tokens in common settings
but we want to treat them as a single token in the context of pull-requests.
For example, code paths and hyper links usually indicate one concept
and they should not be divided into separate words.
As a result of that, we use the regular tokenizer to parse the raw text.
The following are some example regular expressions and the matched text.
\begin{itemize}
\item code path
\begin{itemize}
\item \footnotesize{\texttt{$\backslash$w+(?:$\backslash$:$\backslash$:$\backslash$w+)*}}
\item \textit{``ActionDispatch::Http::URL''}
\end{itemize}
\item number of pull-requests
\begin{itemize}
\item \footnotesize{\texttt{$\backslash$\#$\backslash$d+}}
% \item $thanks?|thxs?|:(\backslash w+)?heart:$
\item \textit{``\#10319''}
\end{itemize}
\end{itemize}
After tokenization, each word will be stemmed to its root form
(\eg \textit{``was''} to \textit{``be''} and \textit{``errors''} to \textit{``error''})
with the help of Porter stemming algorithm~\cite{Porter1997An}.
Finally, common stop words (\eg \textit{``the'', ``we'', ``a''}),
which appear so frequently that they contribute very few
in distinguishing different documents,
will be removed.
\textbf{Transformation.}
We then transform the preprocessed text into multi-dimensional vector
which is computable in Vector Space Model (VSM).
A text is represented as:
$TextVec_{i} = (w_{i,1},w_{i,2}, ..., w_{i,v})$.
Each dimension of the vector corresponds to an unique word in the
corpus built by all the text.
The value of $w_{i,k}$,
the weight of the \textit{k}-th item of the vecotr of \textit{i}-th text,
is computed by the TF-IDF model~\cite{Sun2010A}:
\begin{equation}
w_{i,k} = tf_{i,k} \times idf_{k}
\label{equ:tw}
\end{equation}
In the above formula,
$tf_{i,k}$ denotes the \textit{term frequency}
which is the frequency of \textit{k}-th term appearing in the \textit{i}-th text
and $idf_{i,k}$ denotes the \textit{inverse term frequency}
which measures the distinguishing characteristic of a term.
\textbf{Similarity.}
After transforming text into vectors,
we compute the similarity between a pair of text
using \textit{Cosine}~\cite{Kantor1999} measure
which is calculated by the following formula:
\begin{equation}
\begin{aligned}
Sim(i,j) &= \frac{TextVec_{i} \cdot TextVec_{j}}
{\vert TextVec_{i} \vert \vert TextVec_{j} \vert} \\ \\
&= \frac{\sum_{m=1}^{m=v}(w_{i,m}\times w_{j,m})}
{ \sqrt{\sum_{m=1}^{m=v} w_{i,m}^{2}} \sqrt{\sum_{m=1}^{m=v} w_{j,m}^{2}}}
\label{equ:sim_text}
\end{aligned}
\end{equation}
By applying the this measure,
we can obtain two kinds of similarities between two pull-requests:
$Sim_{title}(i,j)$ and $Sim_{desc}(i,j)$.
$Sim_{title}(i,j)$ measures the similarity between titles
while $Sim_{desc}(i,j)$ measures the similarity between descriptions.
\subsection{Ranking Candidate List}
After $Sim_{title}$, and $Sim_{desc}$
between pull-request pairs are calculated,
we are able to retrieve the target pull-requests according to
these two kinds of similarities.
To produce the candidate list of top-k pull-requests for the given pull-request,
we use the combined similarity~\cite{Wang2008}
computed by the following formula.
\begin{equation}
SimText(i,j) = Sim_{title}(i,j) + Sim_{desc}(i,j)
\label{equ:sim_text}
\end{equation}
In the formula, we use a straightforward heuristic
computing the arithmetic average of $Sim_{title}(i,j)$ and $Sim_{desc}(i,j)$
to get the final textual similarity which is the most widely used combination function.
Finally, the top-k pull-requests which get the most-high $Sim$ with the given pull-request
will be returned in the candidate list.