init

2016-09-01 20:35:31 +08:00 · 2016-09-01 20:35:31 +08:00 · cb01ec2ffb
commit cb01ec2ffb
7 changed files with 2374 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,12 @@
+*.tmp
+~*
+*.tmp
+~*
+*.log
+*.aux
+*.pdf
+*.gz*
+*.sty
+*.eps
+*.bib
+.cls
--- a/Search.docx
+++ b/Search.docx
--- a/Internetware2016-reg.doc
+++ b/Internetware2016-reg.doc
--- a/camera-ready.tex
+++ b/camera-ready.tex
@ -0,0 +1,469 @@
+
+
+\documentclass{sig-alternate-05-2015}
+
+\usepackage{enumitem}
+
+\begin{document}
+
+\CopyrightYear{2016} 
+\setcopyright{acmcopyright}
+\conferenceinfo{Internetware '16,}{September 18 2016, Beijing, China}
+\isbn{978-1-4503-4829-4/16/09}\acmPrice{\$15.00}
+\doi{http://dx.doi.org/10.1145/2993717.2993723}
+
+
+\title{Query Reformulation by Leveraging Crowd Wisdom for Scenario-based Software Search}
+
+
+\numberofauthors{5} 
+\author{
+% 1st. author
+\alignauthor
+Zhixing Li\\
+       \affaddr{National University of Defense Technology}\\
+       \affaddr{Changsha, China, 410073}\\
+       \email{starleelzx@163.com}
+% 2nd. author
+\alignauthor
+Tao Wang\\
+       \affaddr{National University of Defense Technology}\\
+       \affaddr{Changsha, China, 410073}\\
+       \email{starleelzx@163.com}
+% 3rd. author
+\alignauthor
+Yang Zhang\\
+       \affaddr{National University of Defense Technology}\\
+       \affaddr{Changsha, China, 410073}\\
+       \email{starleelzx@163.com}
+\and
+% 4th. author
+\alignauthor
+Yun Zhan\\
+       \affaddr{National University of Defense Technology}\\
+       \affaddr{Changsha, China, 410073}\\
+       \email{cloud\_zhan@163.com}
+% 5th. author
+\alignauthor
+Gang Yin\\
+       \affaddr{National University of Defense Technology}\\
+       \affaddr{Changsha, China, 410073}\\
+       \email{starleelzx@163.com}
+}
+
+\maketitle
+\begin{abstract}
+The Internet-scale open source software (OSS) production in various communities are generating abundant reusable resources for software developers. However, how to retrieve and reuse the desired and mature software from huge amounts of candidates is a great challenge: there are usually big gaps between the user application contexts (that often used as queries) and the OSS key words (that often used to match the queries). In this paper, we define the scenario-based query problem for OSS retrieval, and then we propose a novel approach to reformulate the raw query by leveraging the crowd wisdom from millions of developers to improve the retrieval results. We build a software-specific domain lexical database based on the knowledge in open source communities, by which we can expand and optimize the input queries. The experiment results show that, our approach can reformulate the initial query effectively and outperforms other existing search engines significantly at finding mature software.
+\end{abstract}
+
+
+\begin{CCSXML}
+<ccs2012>
+ <concept>
+  <concept_id>10010520.10010553.10010562</concept_id>
+  <concept_desc>Computer systems organization~Embedded systems</concept_desc>
+  <concept_significance>500</concept_significance>
+ </concept>
+ <concept>
+  <concept_id>10010520.10010575.10010755</concept_id>
+  <concept_desc>Computer systems organization~Redundancy</concept_desc>
+  <concept_significance>300</concept_significance>
+ </concept>
+ <concept>
+  <concept_id>10010520.10010553.10010554</concept_id>
+  <concept_desc>Computer systems organization~Robotics</concept_desc>
+  <concept_significance>100</concept_significance>
+ </concept>
+ <concept>
+  <concept_id>10003033.10003083.10003095</concept_id>
+  <concept_desc>Networks~Network reliability</concept_desc>
+  <concept_significance>100</concept_significance>
+ </concept>
+</ccs2012>  
+\end{CCSXML}
+
+\ccsdesc[500]{Computer systems organization~Embedded systems}
+\ccsdesc[300]{Computer systems organization~Redundancy}
+\ccsdesc{Computer systems organization~Robotics}
+\ccsdesc[100]{Networks~Network reliability}
+
+
+%
+% End generated code
+%
+
+%
+%  Use this command to print the description
+%
+\printccsdesc
+
+% We no longer use \terms command
+%\terms{Theory}
+
+\keywords{software retrieval; crowd wisdom; query reformulation.}
+
+\section{Introduction}
+Software reuse plays a very important role in improving software development quality and efficiency. With the quick development of open source movement, huge amounts of open source software are published over the internet [1]. For example, there are more than 460 thousand projects in SourceForge, and more than 30 million repositories in GitHub, and the number of projects in these communities are continuous growing dramatically every day. This on the one hand provides abundant reusable resources [2],[3], and on the other hand introduces great challenge for locating desired ones among so many candidate projects.
+
+To help developers perform such tasks, many projects hosting sites, like SourceForge (sourceforge.net), and GitHub (github.com), have provided service for open source software search and users can launch a query on the base of software datasets they have indexed. General search engines, like Google, Bing are alternative choices because of their powerful ability for query process. 
+
+But, both of them are not fit for the scenarios that users are aware of only functionality requirement or application context,especially for the fresher who are lack of development experience and programming skills or the experienced developers who are stepping into a new domain. For example, they may search “android database” when actually meant to persist data for Android app, or “python orm” when programming with python and turn to some ORM engines to replace SQL statements. We call this kind of query as Scenario-based Query.  Scenario-based queries are usually short and widely used. Project hosting sites usually match queries with the text contained in software metadata such as title, description, etc. But this strategy can’t match user’s intent perfectly [4]. Results returned by a general search engine cover a wide range of resource and usually need additional clicks and time to filter worthless information [5].
+
+In order to solve the problem of Scenario-based Query, we introduce a novel method to take advantage of crowd wisdom and reformulate the initial query. The approaches to reformulate a query fall into two types: global methods and local methods [4], [13]. Global methods work fully automatically to find the query new terms that are related to its terms and independent of the results returned from it. Local methods make use of the documents that initially appear to match the query and usually rely on user’s feedback or pseudo feedback to mark the top documents as relevant or not relevant. The marked documents are then used to adjust the initial query. However, relevance feedback has been little used in web search and most users tend to perform their search without no more interactions [13],[25].
+
+In this paper, we implement the global approach by using domain knowledge that is obtained from a lexical database of software development which we constructed with the crowd wisdom from millions of developers on StackOverflow (stackoverflow.com). We firstly crawl all the tags created by users in a collaborative process in StackOverflow and then build the domain knowledge with those tags in which way attributes of tag, like count and co-occurrence, play an important role. Given a query, after standard preprocess, our method execute synonymy substitution on each term in the initial query, for example “db” will be transformed into “database”. Next, the query would be expanded using the related terms obtained from the lexical database. Finally, expanded queries are refined by a ranking model to search from project dataset. What’s more, we conducted an empirical evaluation using 14 search scenarios with 35 voluntary developers. We combine measures Precision at k items and Mean Average Precision and use MAP@10 to measure the relevance performance of our method and other search service [13].  We also conduct a user study to assess the usability to help users find mature software. 
+
+In summary, our main contributions in this paper include:
+
+\begin{enumerate}[fullwidth,itemindent=1em,label=\arabic*)]\setlength{\itemsep}{0pt}
+\item We build a software-specific lexical database by leveraging crowd wisdom and effectively analyze the domain knowledge in StackOverflow.
+\item We reformulate queries with software-specific lexical database to get user's real query intension and performs well in scenario-based queries.
+\item Lots of experimental results illustrate that our method can benefit software development by helping users find mature software more efficiently.
+\end{enumerate}
+
+The rest of paper is organized like this: Section 2 reviews briefly related work and Section 3 explains related concepts. Section 4 describes in detail our method through a prototype design. Section 5 presents our empirical experiment and evaluation on our method. Section 6 explains some threats to validity of our method. Finally section 7 conclude this paper with future work.
+
+
+
+\section{RELATED WORK}
+After reviewing a great deal of literature, we found that most studies in the area of open source software search focus on code search [16],[17],[18] while few researchers study on search of software project entities. Tegawende F. Bissyande proposed an integrated search engine Orion [6] which focuses on searching for project entities and provides a uniform interface with a declarative query language. They draw a conclusion that their system can help user find relevant projects faster and more accurately than traditional search engine. But they restricted the search language which results in additional burdens on users when they express their search intention and the usability goes down. Linstead [7] developed Sourcerer which takes advantage not only of the textual aspect of software, but also of its structural aspects, as well as any relevant metadata. This system processes query on the code level and is not fit for higher level search, but it inspires us to see software projects form different perspectives 
+
+Studies [13],[19] listed several techniques to reformulate a query which can be classified into two types: global methods and local methods. Global methods are independent of the results returned from it. Query expansion via a thesaurus is a widely used global method in which the thesaurus can be generated automatically or manually. In this paper, we use a thesaurus generated automatically. Local methods refine a query according to the documents that are retrieved in the first round to match the query. Relevance feedback has been shown to be effective local method to improve relevance of results, but just as Manning stated very few users used the relevance feedback option on the web [13]. Pseudo relevance feedback is another local method which treats the top k ranked items in the result list as relevant and do following work like relevance feedback does. The problem is that it may result in query drift sometimes. So reformulating queries via an atomically generated thesaurus is more practical for software search.
+
+Automatic query reformulation has been a widely used way to overcome inaccuracy of information retrieval systems (AQE).    Carpineto [8] presents a unified view of a large number of approaches to AQE that leverage various data sources and employ very different principles and techniques. Gao [20] presents search logs as a labeled directed graph and expands queries with path-constrained random walk in which the probability of determining an expansion term for a term is computed by a learned combination of constrained random walks on the graph. Lu [9] identifies the part-of-speech of each item in the initial query firstly and then finds the synonyms of each item from WordNet [12]. Like [9], we also use corpus to expand initial query, but the difference is that the corpus we use is software domain specific that we build on open source software consumption communities inspired by Yin and Wang [3] who stated that data from software consumption communities is very important for the evaluation of open source software. 
+
+There are many available approaches to rank software.  OpenBRR [22] makes use of source code, document and other data in software development process to do this job. Their method only consider the software itself and ignore the practical application. SourceForge and OpenHub take advantage of the popularity of a software to rank it.  But the limitation of their methods is that their results sometimes have deviation from the actual situation because all user feedbacks they adopt come from their own platform. Fan [24] and Zhang [26] went further and use user feedbacks coming from consumption communities to assess and rank software. We share the same view of them and think it is more reasonable to make the best of crowd wisdom. 
+
+
+\subsection{Math Equations}
+You may want to display math equations in three distinct styles:
+inline, numbered or non-numbered display.  Each of
+the three are discussed in the next sections.
+
+\subsubsection{Inline (In-text) Equations}
+A formula that appears in the running text is called an
+inline or in-text formula.  It is produced by the
+\textbf{math} environment, which can be
+invoked with the usual \texttt{{\char'134}begin. . .{\char'134}end}
+construction or with the short form \texttt{\$. . .\$}. You
+can use any of the symbols and structures,
+from $\alpha$ to $\omega$, available in
+\LaTeX\cite{Lamport:LaTeX}; this section will simply show a
+few examples of in-text equations in context. Notice how
+this equation: \begin{math}\lim_{n\rightarrow \infty}x=0\end{math},
+set here in in-line math style, looks slightly different when
+set in display style.  (See next section).
+
+\subsubsection{Display Equations}
+A numbered display equation -- one set off by vertical space
+from the text and centered horizontally -- is produced
+by the \textbf{equation} environment. An unnumbered display
+equation is produced by the \textbf{displaymath} environment.
+
+Again, in either environment, you can use any of the symbols
+and structures available in \LaTeX; this section will just
+give a couple of examples of display equations in context.
+First, consider the equation, shown as an inline equation above:
+\begin{equation}\lim_{n\rightarrow \infty}x=0\end{equation}
+Notice how it is formatted somewhat differently in
+the \textbf{displaymath}
+environment.  Now, we'll enter an unnumbered equation:
+\begin{displaymath}\sum_{i=0}^{\infty} x + 1\end{displaymath}
+and follow it with another numbered equation:
+\begin{equation}\sum_{i=0}^{\infty}x_i=\int_{0}^{\pi+2} f\end{equation}
+just to demonstrate \LaTeX's able handling of numbering.
+
+\subsection{Citations}
+Citations to articles \cite{bowman:reasoning,
+clark:pct, braams:babel, herlihy:methodology},
+conference proceedings \cite{clark:pct} or
+books \cite{salas:calculus, Lamport:LaTeX} listed
+in the Bibliography section of your
+article will occur throughout the text of your article.
+You should use BibTeX to automatically produce this bibliography;
+you simply need to insert one of several citation commands with
+a key of the item cited in the proper location in
+the \texttt{.tex} file \cite{Lamport:LaTeX}.
+The key is a short reference you invent to uniquely
+identify each work; in this sample document, the key is
+the first author's surname and a
+word from the title.  This identifying key is included
+with each item in the \texttt{.bib} file for your article.
+
+The details of the construction of the \texttt{.bib} file
+are beyond the scope of this sample document, but more
+information can be found in the \textit{Author's Guide},
+and exhaustive details in the \textit{\LaTeX\ User's
+Guide}\cite{Lamport:LaTeX}.
+
+This article shows only the plainest form
+of the citation command, using \texttt{{\char'134}cite}.
+This is what is stipulated in the SIGS style specifications.
+No other citation format is endorsed or supported.
+
+\subsection{Tables}
+Because tables cannot be split across pages, the best
+placement for them is typically the top of the page
+nearest their initial cite.  To
+ensure this proper ``floating'' placement of tables, use the
+environment \textbf{table} to enclose the table's contents and
+the table caption.  The contents of the table itself must go
+in the \textbf{tabular} environment, to
+be aligned properly in rows and columns, with the desired
+horizontal and vertical rules.  Again, detailed instructions
+on \textbf{tabular} material
+is found in the \textit{\LaTeX\ User's Guide}.
+
+Immediately following this sentence is the point at which
+Table 1 is included in the input file; compare the
+placement of the table here with the table in the printed
+dvi output of this document.
+
+\begin{table}
+\centering
+\caption{Frequency of Special Characters}
+\begin{tabular}{|c|c|l|} \hline
+Non-English or Math&Frequency&Comments\\ \hline
+\O & 1 in 1,000& For Swedish names\\ \hline
+$\pi$ & 1 in 5& Common in math\\ \hline
+\$ & 4 in 5 & Used in business\\ \hline
+$\Psi^2_1$ & 1 in 40,000& Unexplained usage\\
+\hline\end{tabular}
+\end{table}
+
+To set a wider table, which takes up the whole width of
+the page's live area, use the environment
+\textbf{table*} to enclose the table's contents and
+the table caption.  As with a single-column table, this wide
+table will ``float" to a location deemed more desirable.
+Immediately following this sentence is the point at which
+Table 2 is included in the input file; again, it is
+instructive to compare the placement of the
+table here with the table in the printed dvi
+output of this document.
+
+
+\begin{table*}
+\centering
+\caption{Some Typical Commands}
+\begin{tabular}{|c|c|l|} \hline
+Command&A Number&Comments\\ \hline
+\texttt{{\char'134}alignauthor} & 100& Author alignment\\ \hline
+\texttt{{\char'134}numberofauthors}& 200& Author enumeration\\ \hline
+\texttt{{\char'134}table}& 300 & For tables\\ \hline
+\texttt{{\char'134}table*}& 400& For wider tables\\ \hline\end{tabular}
+\end{table*}
+% end the environment with {table*}, NOTE not {table}!
+
+\subsection{Figures}
+Like tables, figures cannot be split across pages; the
+best placement for them
+is typically the top or the bottom of the page nearest
+their initial cite.  To ensure this proper ``floating'' placement
+of figures, use the environment
+\textbf{figure} to enclose the figure and its caption.
+
+This sample document contains examples of \textbf{.eps} files to be
+displayable with \LaTeX.  If you work with pdf\LaTeX, use files in the
+\textbf{.pdf} format.  Note that most modern \TeX\ system will convert
+\textbf{.eps} to \textbf{.pdf} for you on the fly.  More details on
+each of these is found in the \textit{Author's Guide}.
+
+\begin{figure}
+\centering
+\includegraphics{fly}
+\caption{A sample black and white graphic.}
+\end{figure}
+
+\begin{figure}
+\centering
+\includegraphics[height=1in, width=1in]{fly}
+\caption{A sample black and white graphic
+that has been resized with the \texttt{includegraphics} command.}
+\end{figure}
+
+
+As was the case with tables, you may want a figure
+that spans two columns.  To do this, and still to
+ensure proper ``floating'' placement of tables, use the environment
+\textbf{figure*} to enclose the figure and its caption.
+and don't forget to end the environment with
+{figure*}, not {figure}!
+
+\begin{figure*}
+\centering
+\includegraphics{flies}
+\caption{A sample black and white graphic
+that needs to span two columns of text.}
+\end{figure*}
+
+
+\begin{figure}
+\centering
+\includegraphics[height=1in, width=1in]{rosette}
+\caption{A sample black and white graphic that has
+been resized with the \texttt{includegraphics} command.}
+\vskip -6pt
+\end{figure}
+
+\subsection{Theorem-like Constructs}
+Other common constructs that may occur in your article are
+the forms for logical constructs like theorems, axioms,
+corollaries and proofs.  There are
+two forms, one produced by the
+command \texttt{{\char'134}newtheorem} and the
+other by the command \texttt{{\char'134}newdef}; perhaps
+the clearest and easiest way to distinguish them is
+to compare the two in the output of this sample document:
+
+This uses the \textbf{theorem} environment, created by
+the\linebreak\texttt{{\char'134}newtheorem} command:
+\newtheorem{theorem}{Theorem}
+\begin{theorem}
+Let $f$ be continuous on $[a,b]$.  If $G$ is
+an antiderivative for $f$ on $[a,b]$, then
+\begin{displaymath}\int^b_af(t)dt = G(b) - G(a).\end{displaymath}
+\end{theorem}
+
+The other uses the \textbf{definition} environment, created
+by the \texttt{{\char'134}newdef} command:
+\newdef{definition}{Definition}
+\begin{definition}
+If $z$ is irrational, then by $e^z$ we mean the
+unique number which has
+logarithm $z$: \begin{displaymath}{\log e^z = z}\end{displaymath}
+\end{definition}
+
+Two lists of constructs that use one of these
+forms is given in the
+\textit{Author's  Guidelines}.
+ 
+There is one other similar construct environment, which is
+already set up
+for you; i.e. you must \textit{not} use
+a \texttt{{\char'134}newdef} command to
+create it: the \textbf{proof} environment.  Here
+is a example of its use:
+\begin{proof}
+Suppose on the contrary there exists a real number $L$ such that
+\begin{displaymath}
+\lim_{x\rightarrow\infty} \frac{f(x)}{g(x)} = L.
+\end{displaymath}
+Then
+\begin{displaymath}
+l=\lim_{x\rightarrow c} f(x)
+= \lim_{x\rightarrow c}
+\left[ g{x} \cdot \frac{f(x)}{g(x)} \right ]
+= \lim_{x\rightarrow c} g(x) \cdot \lim_{x\rightarrow c}
+\frac{f(x)}{g(x)} = 0\cdot L = 0,
+\end{displaymath}
+which contradicts our assumption that $l\neq 0$.
+\end{proof}
+
+Complete rules about using these environments and using the
+two different creation commands are in the
+\textit{Author's Guide}; please consult it for more
+detailed instructions.  If you need to use another construct,
+not listed therein, which you want to have the same
+formatting as the Theorem
+or the Definition\cite{salas:calculus} shown above,
+use the \texttt{{\char'134}newtheorem} or the
+\texttt{{\char'134}newdef} command,
+respectively, to create it.
+
+\subsection*{A {\secit Caveat} for the \TeX\ Expert}
+Because you have just been given permission to
+use the \texttt{{\char'134}newdef} command to create a
+new form, you might think you can
+use \TeX's \texttt{{\char'134}def} to create a
+new command: \textit{Please refrain from doing this!}
+Remember that your \LaTeX\ source code is primarily intended
+to create camera-ready copy, but may be converted
+to other forms -- e.g. HTML. If you inadvertently omit
+some or all of the \texttt{{\char'134}def}s recompilation will
+be, to say the least, problematic.
+
+\section{Conclusions}
+This paragraph will end the body of this sample document.
+Remember that you might still have Acknowledgments or
+Appendices; brief samples of these
+follow.  There is still the Bibliography to deal with; and
+we will make a disclaimer about that here: with the exception
+of the reference to the \LaTeX\ book, the citations in
+this paper are to articles which have nothing to
+do with the present subject and are used as
+examples only.
+%\end{document}  % This is where a 'short' article might terminate
+
+%ACKNOWLEDGMENTS are optional
+\section{Acknowledgments}
+This section is optional; it is a location for you
+to acknowledge grants, funding, editing assistance and
+what have you.  In the present case, for example, the
+authors would like to thank Gerald Murray of ACM for
+his help in codifying this \textit{Author's Guide}
+and the \textbf{.cls} and \textbf{.tex} files that it describes.
+
+%
+% The following two commands are all you need in the
+% initial runs of your .tex file to
+% produce the bibliography for the citations in your paper.
+\bibliographystyle{abbrv}
+\bibliography{sigproc}  % sigproc.bib is the name of the Bibliography in this case
+% You must have a proper ".bib" file
+%  and remember to run:
+% latex bibtex latex latex
+% to resolve all references
+%
+% ACM needs 'a single self-contained file'!
+%
+%APPENDICES are optional
+%\balancecolumns
+\appendix
+%Appendix A
+\section{Headings in Appendices}
+The rules about hierarchical headings discussed above for
+the body of the article are different in the appendices.
+In the \textbf{appendix} environment, the command
+\textbf{section} is used to
+indicate the start of each Appendix, with alphabetic order
+designation (i.e. the first is A, the second B, etc.) and
+a title (if you include one).  So, if you need
+hierarchical structure
+\textit{within} an Appendix, start with \textbf{subsection} as the
+highest level. Here is an outline of the body of this
+document in Appendix-appropriate form:
+\subsection{Introduction}
+\subsection{The Body of the Paper}
+\subsubsection{Type Changes and  Special Characters}
+\subsubsection{Math Equations}
+\paragraph{Inline (In-text) Equations}
+\paragraph{Display Equations}
+\subsubsection{Citations}
+\subsubsection{Tables}
+\subsubsection{Figures}
+\subsubsection{Theorem-like Constructs}
+\subsubsection*{A Caveat for the \TeX\ Expert}
+\subsection{Conclusions}
+\subsection{Acknowledgments}
+\subsection{Additional Authors}
+This section is inserted by \LaTeX; you do not insert it.
+You just add the names and information in the
+\texttt{{\char'134}additionalauthors} command at the start
+of the document.
+\subsection{References}
+Generated by bibtex from your ~.bib file.  Run latex,
+then bibtex, then latex twice (to resolve references)
+to create the ~.bbl file.  Insert that ~.bbl file into
+the .tex source file and comment out
+the command \texttt{{\char'134}thebibliography}.
+% This next section command marks the start of
+% Appendix B, and does not continue the present hierarchy
+\section{More Help for the Hardy}
+The sig-alternate.cls file itself is chock-full of succinct
+and helpful comments.  If you consider yourself a moderately
+experienced to expert user of \LaTeX, you may find reading
+it useful but please remember not to change it.
+%\balancecolumns % GM June 2007
+% That's all folks!
+\end{document}
--- a/pubform.doc
+++ b/pubform.doc
--- a/sig-alternate-05-2015.cls
+++ b/sig-alternate-05-2015.cls
--- a/中文版-in.docx
+++ b/中文版-in.docx