This commit is contained in:
StarLee 2016-09-01 20:35:31 +08:00
commit cb01ec2ffb
7 changed files with 2374 additions and 0 deletions

12
.gitignore vendored Normal file
View File

@ -0,0 +1,12 @@
*.tmp
~*
*.tmp
~*
*.log
*.aux
*.pdf
*.gz*
*.sty
*.eps
*.bib
.cls

BIN
Internetware2016-reg.doc Normal file

Binary file not shown.

469
camera-ready.tex Normal file
View File

@ -0,0 +1,469 @@
\documentclass{sig-alternate-05-2015}
\usepackage{enumitem}
\begin{document}
\CopyrightYear{2016}
\setcopyright{acmcopyright}
\conferenceinfo{Internetware '16,}{September 18 2016, Beijing, China}
\isbn{978-1-4503-4829-4/16/09}\acmPrice{\$15.00}
\doi{http://dx.doi.org/10.1145/2993717.2993723}
\title{Query Reformulation by Leveraging Crowd Wisdom for Scenario-based Software Search}
\numberofauthors{5}
\author{
% 1st. author
\alignauthor
Zhixing Li\\
\affaddr{National University of Defense Technology}\\
\affaddr{Changsha, China, 410073}\\
\email{starleelzx@163.com}
% 2nd. author
\alignauthor
Tao Wang\\
\affaddr{National University of Defense Technology}\\
\affaddr{Changsha, China, 410073}\\
\email{starleelzx@163.com}
% 3rd. author
\alignauthor
Yang Zhang\\
\affaddr{National University of Defense Technology}\\
\affaddr{Changsha, China, 410073}\\
\email{starleelzx@163.com}
\and
% 4th. author
\alignauthor
Yun Zhan\\
\affaddr{National University of Defense Technology}\\
\affaddr{Changsha, China, 410073}\\
\email{cloud\_zhan@163.com}
% 5th. author
\alignauthor
Gang Yin\\
\affaddr{National University of Defense Technology}\\
\affaddr{Changsha, China, 410073}\\
\email{starleelzx@163.com}
}
\maketitle
\begin{abstract}
The Internet-scale open source software (OSS) production in various communities are generating abundant reusable resources for software developers. However, how to retrieve and reuse the desired and mature software from huge amounts of candidates is a great challenge: there are usually big gaps between the user application contexts (that often used as queries) and the OSS key words (that often used to match the queries). In this paper, we define the scenario-based query problem for OSS retrieval, and then we propose a novel approach to reformulate the raw query by leveraging the crowd wisdom from millions of developers to improve the retrieval results. We build a software-specific domain lexical database based on the knowledge in open source communities, by which we can expand and optimize the input queries. The experiment results show that, our approach can reformulate the initial query effectively and outperforms other existing search engines significantly at finding mature software.
\end{abstract}
\begin{CCSXML}
<ccs2012>
<concept>
<concept_id>10010520.10010553.10010562</concept_id>
<concept_desc>Computer systems organization~Embedded systems</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10010520.10010575.10010755</concept_id>
<concept_desc>Computer systems organization~Redundancy</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10010520.10010553.10010554</concept_id>
<concept_desc>Computer systems organization~Robotics</concept_desc>
<concept_significance>100</concept_significance>
</concept>
<concept>
<concept_id>10003033.10003083.10003095</concept_id>
<concept_desc>Networks~Network reliability</concept_desc>
<concept_significance>100</concept_significance>
</concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[500]{Computer systems organization~Embedded systems}
\ccsdesc[300]{Computer systems organization~Redundancy}
\ccsdesc{Computer systems organization~Robotics}
\ccsdesc[100]{Networks~Network reliability}
%
% End generated code
%
%
% Use this command to print the description
%
\printccsdesc
% We no longer use \terms command
%\terms{Theory}
\keywords{software retrieval; crowd wisdom; query reformulation.}
\section{Introduction}
Software reuse plays a very important role in improving software development quality and efficiency. With the quick development of open source movement, huge amounts of open source software are published over the internet [1]. For example, there are more than 460 thousand projects in SourceForge, and more than 30 million repositories in GitHub, and the number of projects in these communities are continuous growing dramatically every day. This on the one hand provides abundant reusable resources [2],[3], and on the other hand introduces great challenge for locating desired ones among so many candidate projects.
To help developers perform such tasks, many projects hosting sites, like SourceForge (sourceforge.net), and GitHub (github.com), have provided service for open source software search and users can launch a query on the base of software datasets they have indexed. General search engines, like Google, Bing are alternative choices because of their powerful ability for query process.
But, both of them are not fit for the scenarios that users are aware of only functionality requirement or application context,especially for the fresher who are lack of development experience and programming skills or the experienced developers who are stepping into a new domain. For example, they may search “android database” when actually meant to persist data for Android app, or “python orm” when programming with python and turn to some ORM engines to replace SQL statements. We call this kind of query as Scenario-based Query. Scenario-based queries are usually short and widely used. Project hosting sites usually match queries with the text contained in software metadata such as title, description, etc. But this strategy cant match users intent perfectly [4]. Results returned by a general search engine cover a wide range of resource and usually need additional clicks and time to filter worthless information [5].
In order to solve the problem of Scenario-based Query, we introduce a novel method to take advantage of crowd wisdom and reformulate the initial query. The approaches to reformulate a query fall into two types: global methods and local methods [4], [13]. Global methods work fully automatically to find the query new terms that are related to its terms and independent of the results returned from it. Local methods make use of the documents that initially appear to match the query and usually rely on users feedback or pseudo feedback to mark the top documents as relevant or not relevant. The marked documents are then used to adjust the initial query. However, relevance feedback has been little used in web search and most users tend to perform their search without no more interactions [13],[25].
In this paper, we implement the global approach by using domain knowledge that is obtained from a lexical database of software development which we constructed with the crowd wisdom from millions of developers on StackOverflow (stackoverflow.com). We firstly crawl all the tags created by users in a collaborative process in StackOverflow and then build the domain knowledge with those tags in which way attributes of tag, like count and co-occurrence, play an important role. Given a query, after standard preprocess, our method execute synonymy substitution on each term in the initial query, for example “db” will be transformed into “database”. Next, the query would be expanded using the related terms obtained from the lexical database. Finally, expanded queries are refined by a ranking model to search from project dataset. Whats more, we conducted an empirical evaluation using 14 search scenarios with 35 voluntary developers. We combine measures Precision at k items and Mean Average Precision and use MAP@10 to measure the relevance performance of our method and other search service [13]. We also conduct a user study to assess the usability to help users find mature software.
In summary, our main contributions in this paper include:
\begin{enumerate}[fullwidth,itemindent=1em,label=\arabic*)]\setlength{\itemsep}{0pt}
\item We build a software-specific lexical database by leveraging crowd wisdom and effectively analyze the domain knowledge in StackOverflow.
\item We reformulate queries with software-specific lexical database to get user's real query intension and performs well in scenario-based queries.
\item Lots of experimental results illustrate that our method can benefit software development by helping users find mature software more efficiently.
\end{enumerate}
The rest of paper is organized like this: Section 2 reviews briefly related work and Section 3 explains related concepts. Section 4 describes in detail our method through a prototype design. Section 5 presents our empirical experiment and evaluation on our method. Section 6 explains some threats to validity of our method. Finally section 7 conclude this paper with future work.
\section{RELATED WORK}
After reviewing a great deal of literature, we found that most studies in the area of open source software search focus on code search [16],[17],[18] while few researchers study on search of software project entities. Tegawende F. Bissyande proposed an integrated search engine Orion [6] which focuses on searching for project entities and provides a uniform interface with a declarative query language. They draw a conclusion that their system can help user find relevant projects faster and more accurately than traditional search engine. But they restricted the search language which results in additional burdens on users when they express their search intention and the usability goes down. Linstead [7] developed Sourcerer which takes advantage not only of the textual aspect of software, but also of its structural aspects, as well as any relevant metadata. This system processes query on the code level and is not fit for higher level search, but it inspires us to see software projects form different perspectives
Studies [13],[19] listed several techniques to reformulate a query which can be classified into two types: global methods and local methods. Global methods are independent of the results returned from it. Query expansion via a thesaurus is a widely used global method in which the thesaurus can be generated automatically or manually. In this paper, we use a thesaurus generated automatically. Local methods refine a query according to the documents that are retrieved in the first round to match the query. Relevance feedback has been shown to be effective local method to improve relevance of results, but just as Manning stated very few users used the relevance feedback option on the web [13]. Pseudo relevance feedback is another local method which treats the top k ranked items in the result list as relevant and do following work like relevance feedback does. The problem is that it may result in query drift sometimes. So reformulating queries via an atomically generated thesaurus is more practical for software search.
Automatic query reformulation has been a widely used way to overcome inaccuracy of information retrieval systems (AQE). Carpineto [8] presents a unified view of a large number of approaches to AQE that leverage various data sources and employ very different principles and techniques. Gao [20] presents search logs as a labeled directed graph and expands queries with path-constrained random walk in which the probability of determining an expansion term for a term is computed by a learned combination of constrained random walks on the graph. Lu [9] identifies the part-of-speech of each item in the initial query firstly and then finds the synonyms of each item from WordNet [12]. Like [9], we also use corpus to expand initial query, but the difference is that the corpus we use is software domain specific that we build on open source software consumption communities inspired by Yin and Wang [3] who stated that data from software consumption communities is very important for the evaluation of open source software.
There are many available approaches to rank software. OpenBRR [22] makes use of source code, document and other data in software development process to do this job. Their method only consider the software itself and ignore the practical application. SourceForge and OpenHub take advantage of the popularity of a software to rank it. But the limitation of their methods is that their results sometimes have deviation from the actual situation because all user feedbacks they adopt come from their own platform. Fan [24] and Zhang [26] went further and use user feedbacks coming from consumption communities to assess and rank software. We share the same view of them and think it is more reasonable to make the best of crowd wisdom.
\subsection{Math Equations}
You may want to display math equations in three distinct styles:
inline, numbered or non-numbered display. Each of
the three are discussed in the next sections.
\subsubsection{Inline (In-text) Equations}
A formula that appears in the running text is called an
inline or in-text formula. It is produced by the
\textbf{math} environment, which can be
invoked with the usual \texttt{{\char'134}begin. . .{\char'134}end}
construction or with the short form \texttt{\$. . .\$}. You
can use any of the symbols and structures,
from $\alpha$ to $\omega$, available in
\LaTeX\cite{Lamport:LaTeX}; this section will simply show a
few examples of in-text equations in context. Notice how
this equation: \begin{math}\lim_{n\rightarrow \infty}x=0\end{math},
set here in in-line math style, looks slightly different when
set in display style. (See next section).
\subsubsection{Display Equations}
A numbered display equation -- one set off by vertical space
from the text and centered horizontally -- is produced
by the \textbf{equation} environment. An unnumbered display
equation is produced by the \textbf{displaymath} environment.
Again, in either environment, you can use any of the symbols
and structures available in \LaTeX; this section will just
give a couple of examples of display equations in context.
First, consider the equation, shown as an inline equation above:
\begin{equation}\lim_{n\rightarrow \infty}x=0\end{equation}
Notice how it is formatted somewhat differently in
the \textbf{displaymath}
environment. Now, we'll enter an unnumbered equation:
\begin{displaymath}\sum_{i=0}^{\infty} x + 1\end{displaymath}
and follow it with another numbered equation:
\begin{equation}\sum_{i=0}^{\infty}x_i=\int_{0}^{\pi+2} f\end{equation}
just to demonstrate \LaTeX's able handling of numbering.
\subsection{Citations}
Citations to articles \cite{bowman:reasoning,
clark:pct, braams:babel, herlihy:methodology},
conference proceedings \cite{clark:pct} or
books \cite{salas:calculus, Lamport:LaTeX} listed
in the Bibliography section of your
article will occur throughout the text of your article.
You should use BibTeX to automatically produce this bibliography;
you simply need to insert one of several citation commands with
a key of the item cited in the proper location in
the \texttt{.tex} file \cite{Lamport:LaTeX}.
The key is a short reference you invent to uniquely
identify each work; in this sample document, the key is
the first author's surname and a
word from the title. This identifying key is included
with each item in the \texttt{.bib} file for your article.
The details of the construction of the \texttt{.bib} file
are beyond the scope of this sample document, but more
information can be found in the \textit{Author's Guide},
and exhaustive details in the \textit{\LaTeX\ User's
Guide}\cite{Lamport:LaTeX}.
This article shows only the plainest form
of the citation command, using \texttt{{\char'134}cite}.
This is what is stipulated in the SIGS style specifications.
No other citation format is endorsed or supported.
\subsection{Tables}
Because tables cannot be split across pages, the best
placement for them is typically the top of the page
nearest their initial cite. To
ensure this proper ``floating'' placement of tables, use the
environment \textbf{table} to enclose the table's contents and
the table caption. The contents of the table itself must go
in the \textbf{tabular} environment, to
be aligned properly in rows and columns, with the desired
horizontal and vertical rules. Again, detailed instructions
on \textbf{tabular} material
is found in the \textit{\LaTeX\ User's Guide}.
Immediately following this sentence is the point at which
Table 1 is included in the input file; compare the
placement of the table here with the table in the printed
dvi output of this document.
\begin{table}
\centering
\caption{Frequency of Special Characters}
\begin{tabular}{|c|c|l|} \hline
Non-English or Math&Frequency&Comments\\ \hline
\O & 1 in 1,000& For Swedish names\\ \hline
$\pi$ & 1 in 5& Common in math\\ \hline
\$ & 4 in 5 & Used in business\\ \hline
$\Psi^2_1$ & 1 in 40,000& Unexplained usage\\
\hline\end{tabular}
\end{table}
To set a wider table, which takes up the whole width of
the page's live area, use the environment
\textbf{table*} to enclose the table's contents and
the table caption. As with a single-column table, this wide
table will ``float" to a location deemed more desirable.
Immediately following this sentence is the point at which
Table 2 is included in the input file; again, it is
instructive to compare the placement of the
table here with the table in the printed dvi
output of this document.
\begin{table*}
\centering
\caption{Some Typical Commands}
\begin{tabular}{|c|c|l|} \hline
Command&A Number&Comments\\ \hline
\texttt{{\char'134}alignauthor} & 100& Author alignment\\ \hline
\texttt{{\char'134}numberofauthors}& 200& Author enumeration\\ \hline
\texttt{{\char'134}table}& 300 & For tables\\ \hline
\texttt{{\char'134}table*}& 400& For wider tables\\ \hline\end{tabular}
\end{table*}
% end the environment with {table*}, NOTE not {table}!
\subsection{Figures}
Like tables, figures cannot be split across pages; the
best placement for them
is typically the top or the bottom of the page nearest
their initial cite. To ensure this proper ``floating'' placement
of figures, use the environment
\textbf{figure} to enclose the figure and its caption.
This sample document contains examples of \textbf{.eps} files to be
displayable with \LaTeX. If you work with pdf\LaTeX, use files in the
\textbf{.pdf} format. Note that most modern \TeX\ system will convert
\textbf{.eps} to \textbf{.pdf} for you on the fly. More details on
each of these is found in the \textit{Author's Guide}.
\begin{figure}
\centering
\includegraphics{fly}
\caption{A sample black and white graphic.}
\end{figure}
\begin{figure}
\centering
\includegraphics[height=1in, width=1in]{fly}
\caption{A sample black and white graphic
that has been resized with the \texttt{includegraphics} command.}
\end{figure}
As was the case with tables, you may want a figure
that spans two columns. To do this, and still to
ensure proper ``floating'' placement of tables, use the environment
\textbf{figure*} to enclose the figure and its caption.
and don't forget to end the environment with
{figure*}, not {figure}!
\begin{figure*}
\centering
\includegraphics{flies}
\caption{A sample black and white graphic
that needs to span two columns of text.}
\end{figure*}
\begin{figure}
\centering
\includegraphics[height=1in, width=1in]{rosette}
\caption{A sample black and white graphic that has
been resized with the \texttt{includegraphics} command.}
\vskip -6pt
\end{figure}
\subsection{Theorem-like Constructs}
Other common constructs that may occur in your article are
the forms for logical constructs like theorems, axioms,
corollaries and proofs. There are
two forms, one produced by the
command \texttt{{\char'134}newtheorem} and the
other by the command \texttt{{\char'134}newdef}; perhaps
the clearest and easiest way to distinguish them is
to compare the two in the output of this sample document:
This uses the \textbf{theorem} environment, created by
the\linebreak\texttt{{\char'134}newtheorem} command:
\newtheorem{theorem}{Theorem}
\begin{theorem}
Let $f$ be continuous on $[a,b]$. If $G$ is
an antiderivative for $f$ on $[a,b]$, then
\begin{displaymath}\int^b_af(t)dt = G(b) - G(a).\end{displaymath}
\end{theorem}
The other uses the \textbf{definition} environment, created
by the \texttt{{\char'134}newdef} command:
\newdef{definition}{Definition}
\begin{definition}
If $z$ is irrational, then by $e^z$ we mean the
unique number which has
logarithm $z$: \begin{displaymath}{\log e^z = z}\end{displaymath}
\end{definition}
Two lists of constructs that use one of these
forms is given in the
\textit{Author's Guidelines}.
There is one other similar construct environment, which is
already set up
for you; i.e. you must \textit{not} use
a \texttt{{\char'134}newdef} command to
create it: the \textbf{proof} environment. Here
is a example of its use:
\begin{proof}
Suppose on the contrary there exists a real number $L$ such that
\begin{displaymath}
\lim_{x\rightarrow\infty} \frac{f(x)}{g(x)} = L.
\end{displaymath}
Then
\begin{displaymath}
l=\lim_{x\rightarrow c} f(x)
= \lim_{x\rightarrow c}
\left[ g{x} \cdot \frac{f(x)}{g(x)} \right ]
= \lim_{x\rightarrow c} g(x) \cdot \lim_{x\rightarrow c}
\frac{f(x)}{g(x)} = 0\cdot L = 0,
\end{displaymath}
which contradicts our assumption that $l\neq 0$.
\end{proof}
Complete rules about using these environments and using the
two different creation commands are in the
\textit{Author's Guide}; please consult it for more
detailed instructions. If you need to use another construct,
not listed therein, which you want to have the same
formatting as the Theorem
or the Definition\cite{salas:calculus} shown above,
use the \texttt{{\char'134}newtheorem} or the
\texttt{{\char'134}newdef} command,
respectively, to create it.
\subsection*{A {\secit Caveat} for the \TeX\ Expert}
Because you have just been given permission to
use the \texttt{{\char'134}newdef} command to create a
new form, you might think you can
use \TeX's \texttt{{\char'134}def} to create a
new command: \textit{Please refrain from doing this!}
Remember that your \LaTeX\ source code is primarily intended
to create camera-ready copy, but may be converted
to other forms -- e.g. HTML. If you inadvertently omit
some or all of the \texttt{{\char'134}def}s recompilation will
be, to say the least, problematic.
\section{Conclusions}
This paragraph will end the body of this sample document.
Remember that you might still have Acknowledgments or
Appendices; brief samples of these
follow. There is still the Bibliography to deal with; and
we will make a disclaimer about that here: with the exception
of the reference to the \LaTeX\ book, the citations in
this paper are to articles which have nothing to
do with the present subject and are used as
examples only.
%\end{document} % This is where a 'short' article might terminate
%ACKNOWLEDGMENTS are optional
\section{Acknowledgments}
This section is optional; it is a location for you
to acknowledge grants, funding, editing assistance and
what have you. In the present case, for example, the
authors would like to thank Gerald Murray of ACM for
his help in codifying this \textit{Author's Guide}
and the \textbf{.cls} and \textbf{.tex} files that it describes.
%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc} % sigproc.bib is the name of the Bibliography in this case
% You must have a proper ".bib" file
% and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%
%APPENDICES are optional
%\balancecolumns
\appendix
%Appendix A
\section{Headings in Appendices}
The rules about hierarchical headings discussed above for
the body of the article are different in the appendices.
In the \textbf{appendix} environment, the command
\textbf{section} is used to
indicate the start of each Appendix, with alphabetic order
designation (i.e. the first is A, the second B, etc.) and
a title (if you include one). So, if you need
hierarchical structure
\textit{within} an Appendix, start with \textbf{subsection} as the
highest level. Here is an outline of the body of this
document in Appendix-appropriate form:
\subsection{Introduction}
\subsection{The Body of the Paper}
\subsubsection{Type Changes and Special Characters}
\subsubsection{Math Equations}
\paragraph{Inline (In-text) Equations}
\paragraph{Display Equations}
\subsubsection{Citations}
\subsubsection{Tables}
\subsubsection{Figures}
\subsubsection{Theorem-like Constructs}
\subsubsection*{A Caveat for the \TeX\ Expert}
\subsection{Conclusions}
\subsection{Acknowledgments}
\subsection{Additional Authors}
This section is inserted by \LaTeX; you do not insert it.
You just add the names and information in the
\texttt{{\char'134}additionalauthors} command at the start
of the document.
\subsection{References}
Generated by bibtex from your ~.bib file. Run latex,
then bibtex, then latex twice (to resolve references)
to create the ~.bbl file. Insert that ~.bbl file into
the .tex source file and comment out
the command \texttt{{\char'134}thebibliography}.
% This next section command marks the start of
% Appendix B, and does not continue the present hierarchy
\section{More Help for the Hardy}
The sig-alternate.cls file itself is chock-full of succinct
and helpful comments. If you consider yourself a moderately
experienced to expert user of \LaTeX, you may find reading
it useful but please remember not to change it.
%\balancecolumns % GM June 2007
% That's all folks!
\end{document}

BIN
pubform.doc Normal file

Binary file not shown.

1893
sig-alternate-05-2015.cls Normal file

File diff suppressed because it is too large Load Diff

BIN
中文版-in.docx Normal file

Binary file not shown.