upd dependencies and prevent single

update requirements and use scalar quantization
update links and about illustration
2024-10-12 01:14:29 +02:00 · 2024-05-29 11:04:03 +02:00 · 2024-05-15 23:56:18 +02:00 · 2024-03-08 10:17:58 +01:00 · 2024-03-07 11:27:20 +01:00 · 2024-03-05 21:19:39 +01:00
11 changed files with 1451 additions and 1031 deletions
--- a/.github/workflows/deploy.yml
+++ b/.github/workflows/deploy.yml
@ -24,6 +24,7 @@ jobs:
        cd $GITHUB_WORKSPACE
        docker build -t qdrant/code-search-web:${{ github.sha }} .
        docker save -o code-search-web.tar qdrant/code-search-web:${{ github.sha }}
+        chmod 666 code-search-web.tar
        ls -al .
    - name: copy data with ssh
      uses: appleboy/scp-action@master
--- a/README.md
+++ b/README.md
@ -1,9 +1,241 @@
-# Semantic Search for code - demo
+# Code search with Qdrant

-## Overview
+Developers need a code search tool that helps them find the right piece of code. In this README, we describe how
+you can set up a tool that provides code results, in context.

-ToDo
+## Online version

-## How to run
+See our code search tool "in action." Navigate to 
+**[https://code-search.qdrant.tech/](https://code-search.qdrant.tech/)**. We've prepopulated the demo with Qdrant 
+codebase. You can see the results, in context, even with relatively vague search terms.

-ToDo
+## Prerequisites
+
+To run this demo on your own system, install and/or set up the following components:
+
+- [Docker](https://www.docker.com/)
+- [Docker Compose](https://docs.docker.com/compose/)
+- [Rust](https://www.rust-lang.org/learn/get-started)
+- [rust-analyzer](https://rust-analyzer.github.io/)
+
+Docker and Docker Compose setup depends on your operating system. Please refer to the official documentation for
+instructions on how to install them. Both Rust and rust-analyzer can be installed with the following commands:
+
+```shell
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+rustup component add rust-analyzer
+```
+
+## Description
+
+You can set up [Qdrant](https://qdrant.tech) to help developers find the code they need, with context. Using semantic
+search, developers can find the code samples that can help them do their day-to-day work, even with:
+
+- Imprecise keywords
+- Inexact names for functions, classes or variables
+- Some other code snippets
+
+The demo uses [Qdrant source code](https://github.com/qdrant/qdrant) to build an end-to-end code search application that
+helps you find the right piece of code, even if you have never contributed to the project. We implemented an end-to-end
+process, including data chunking, indexing, and search. Code search is a very specific task in which the programming 
+language syntax matters as much as the function, class, variable names, and the docstring, describing what and why. 
+While the latter is more of a traditional natural language processing task, the former requires a specific approach. 
+Thus, we use the following neural encoders for our use cases:
+
+- `all-MiniLM-L6-v2` - one of the gold standard models for natural language processing
+- `microsoft/unixcoder-base` - a model trained specifically on a code dataset
+
+### Chunking and indexing process
+
+Semantic search works best with _structured_ source code repositories, with good syntax, as well as best practices
+as defined by the authoring team. If your code base needs help, start by dividing the code into chunks. Each
+chunk should correspond to a specific function, struct, enum, or any other code structure that might be considered as a whole.
+
+There is a separate model-specific logic that extracts the most important parts of the code and converts them
+into a format that the neural network can understand. Only then, the encoded representation is indexed in the Qdrant 
+collection, along with a JSON structure describing that snippet as a payload.
+
+To that end, we work with the following models. The combination is the "best of both worlds."
+
+#### all-MiniLM-L6-v2
+
+Before the encoding, code is divided into chunks, but contrary to the traditional NLP challenges, it contains not only
+the definition of the function or class but also the context in which appears. While doing code search it's important 
+to know where the function is defined, in which module, and in which file. This information is crucial to present the
+results to the user in a meaningful way.
+
+For example, the `upsert` function from one of Qdrant's modules would be represented as the following structure:
+
+```json
+{
+    "name": "upsert",
+    "signature": "fn upsert (& mut self , id : PointOffsetType , vector : SparseVector)",
+    "code_type": "Function",
+    "docstring": "= \" Upsert a vector into the inverted index.\"",
+    "line": 105,
+    "line_from": 104,
+    "line_to": 125,
+    "context": {
+        "module": "inverted_index",
+        "file_path": "lib/sparse/src/index/inverted_index/inverted_index_ram.rs",
+        "file_name": "inverted_index_ram.rs",
+        "struct_name": "InvertedIndexRam",
+        "snippet": "    /// Upsert a vector into the inverted index.\n    pub fn upsert(&mut self, id: PointOffsetType, vector: SparseVector) {\n        for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {\n            let dim_id = dim_id as usize;\n            match self.postings.get_mut(dim_id) {\n                Some(posting) => {\n                    // update existing posting list\n                    let posting_element = PostingElement::new(id, weight);\n                    posting.upsert(posting_element);\n                }\n                None => {\n                    // resize postings vector (fill gaps with empty posting lists)\n                    self.postings.resize_with(dim_id + 1, PostingList::default);\n                    // initialize new posting for dimension\n                    self.postings[dim_id] = PostingList::new_one(id, weight);\n                }\n            }\n        }\n        // given that there are no holes in the internal ids and that we are not deleting from the index\n        // we can just use the id as a proxy the count\n        self.vector_count = max(self.vector_count, id as usize);\n    }\n"
+    }
+}
+```
+
+> Please note that this project aims to create a search mechanism specifically for Qdrant source code written in Rust.
+Thus, we built a small separate [rust-parser project](https://github.com/qdrant/rust-parser) that converts it into the 
+before-mentioned JSON objects. It uses [Syn](https://docs.rs/syn/latest/syn/index.html) to read the syntax tree of the 
+codebase. If you want to replicate the project for a different programming language, you will need to build a similar 
+parser for that language. For example, Python has a similar library called [ast](https://docs.python.org/3/library/ast.html), 
+but there might be some differences in the way the code is parsed, thus some adjustments might be required.
+
+Since the `all-MiniLM-L6-v2` model is trained for more natural language tasks, it won't be able to understand the
+code directly. For that reason, **we build a fake text-like representation of the structure, that should be 
+understandable for the model**, or its tokenizer to be more specific. Such representation won't contain the actual code, 
+but rather the important parts of it, like the function name, its signature, and the docstring, but also many more. All 
+the special, language-specific characters are removed, to keep the names and signatures as clean as possible. Only that
+representation is then passed to the model.
+
+For example, the `upsert` function from the example above would be represented as:
+
+```python
+'Function upsert that does: = " Upsert a vector into the inverted index." defined as fn upsert mut self id Point Offset Type vector Sparse Vector  in struct InvertedIndexRam  in module inverted_index  in file inverted_index_ram.rs'
+```
+
+In the properly structured codebase, both module and file names should carry some additional information about the 
+semantics of that piece of code. For example, the `upsert` function is defined in the `InvertedIndexRam` struct, which 
+is a part of the `inverted_index`, which indicates that it is a part of the inverted index implementation stored in 
+memory. It is unclear from the function name itself.
+
+> If you want to see how the conversion is implemented in general, please check the `textify` function in the 
+`code_search.index.textifier` module.
+
+#### microsoft/unixcoder-base
+
+In that case, the model focuses specifically on the code snippets. We take the definitions along with the corresponding 
+docstrings and pass them to the model. Extracting all the definitions is not a trivial task, but there are various 
+Language Server Protocol (**LSP**) implementations that can help with that, and you should be able to [find one for
+your programming language](https://microsoft.github.io/language-server-protocol/implementors/servers/). For Rust, we 
+used the [rust-analyzer](https://rust-analyzer.github.io/) that is capable of converting the codebase into the [LSIF 
+format](https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/), which is a 
+universal, JSON-based format for code, regardless of the programming language.
+
+The same `upsert` function from the example above would be represented in LSIF as multiple entries and won't contain
+the definition itself but just the location, so we have to extract it from the source file on our own. 
+
+Even though the `microsoft/unixcoder-base` model does not officially support Rust, we found it to be working quite well 
+for the task. Obtaining the embeddings for the code snippets is quite straightforward, as we just send the code snippet 
+directly to the model:
+
+```rust
+/// Upsert a vector into the inverted index.
+pub fn upsert(&mut self, id: PointOffsetType, vector: SparseVector) {
+    for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {
+        let dim_id = dim_id as usize;
+        match self.postings.get_mut(dim_id) {
+            Some(posting) => {
+                // update existing posting list
+                let posting_element = PostingElement::new(id, weight);
+                posting.upsert(posting_element);
+            }
+            None => {
+                // resize postings vector (fill gaps with empty posting lists)
+                self.postings.resize_with(dim_id + 1, PostingList::default);
+                // initialize new posting for dimension
+                self.postings[dim_id] = PostingList::new_one(id, weight);
+            }
+        }
+    }
+    // given that there are no holes in the internal ids and that we are not deleting from the index
+    // we can just use the id as a proxy the count
+    self.vector_count = max(self.vector_count, id as usize);
+}
+```
+
+Having both encoders should help us build a more robust search mechanism, that can handle both the natural language and 
+code-specific queries.
+
+### Search process
+
+The search process is quite straightforward. The user input is passed to both encoders, and the resulting vectors are
+used to query both Qdrant collections at the same time. The results are then merged with duplicates removed and returned 
+back to the user. 
+
+## Architecture
+
+The demo uses [FastAPI](https://fastapi.tiangolo.com/) framework for the backend and [React](https://reactjs.org/) for 
+the frontend layer. 
+
+![Architecture of the code search demo](images/architecture-diagram.png)
+
+The demo consists of the following components:
+- [React frontend](/frontend) - a web application that allows the user to search over Qdrant codebase
+- [FastAPI backend](/code_search/service.py) - a backend that communicates with Qdrant and exposes a REST API
+- [Qdrant](https://qdrant.tech/) - a vector search engine that stores the data and performs the search
+- Two neural encoders - one trained on the natural language and one for the code-specific tasks
+
+There is also an additional indexing component that has to be run periodically to keep the index up to date. It is also
+part of the demo, but it is not directly exposed to the user. All the required scripts are documented below, and you can 
+find them in the [`tools`](/tools) directory.
+
+The demo is, as always, open source, so feel free to check the code in this repository to see how it is implemented.
+
+## Usage
+
+As every other semantic search system, the demo requires a few steps to be set up. First of all, the data has to be
+ingested, so we can then use the created index for our queries.
+
+### Data indexing
+
+Qdrant is used as a search engine, so you will need to have it running somewhere. You can either use the local container
+or the Cloud version. If you want to use the local version, you can start it with the following command:
+
+```shell
+docker run -p 6333:6333 -p 6334:6334 \
+    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
+    qdrant/qdrant
+```
+
+However, the easiest way to start using Qdrant is to use our Cloud version. You can sign up for a free tier 1GB cluster 
+at [https://cloud.qdrant.io/](https://cloud.qdrant.io/).
+
+Once the environment is set up, you can configure the Qdrant instance and build the index by running the following 
+commands:
+
+```shell
+export QDRANT_URL="http://localhost:6333"
+
+# For the Cloud service you need to specify the api key as well
+# export QDRANT_API_KEY="your-api-key"
+
+bash tools/download_and_index.sh
+```
+
+The indexing process might take a while, as it needs to encode all the code snippets and send them to the Qdrant.
+
+### Search service
+
+Once the index is built, you can start the search service by running the following commands:
+
+```shell
+docker-compose up
+```
+
+The UI will be available at [http://localhost:8000/](http://localhost:8000/). This is how it should look like:
+
+![Code search with Qdrant](images/code-search-ui.png)
+
+You can type in the search query and see the related code structures. Queries might come both from natural language
+but also from the code itself. 
+
+## Further steps
+
+If you would like to take the demo further, you can try to:
+
+1. Disable one of the neural encoders and see how the search results change.
+2. Try out some other encoder models and see the impact on the search quality.
+3. Fork the project and support programming languages other than Rust.
+4. Build a ground truth dataset and evaluate the search quality.
--- a/code_search/index/upload_code.py
+++ b/code_search/index/upload_code.py
@ -74,19 +74,24 @@ def encode_and_upload():
        vectors_config=rest.VectorParams(
            size=len(embeddings[1]),
            distance=rest.Distance.COSINE,
+            on_disk=True,
+        ),
+        quantization_config=rest.ScalarQuantization(
+            scalar=rest.ScalarQuantizationConfig(
+                type=rest.ScalarType.INT8,
+                always_ram=True,
+                quantile=0.99,
+            )
        )
    )

    print(f"Storing data in the collection {collection_name}")
-    response = client.upsert(
+    client.upload_collection(
        collection_name=collection_name,
-        points=rest.Batch(
-            ids=[i for i, _ in enumerate(embeddings)],
-            vectors=embeddings,
-            payloads=payloads,
-        ),
+        ids=[i for i, _ in enumerate(embeddings)],
+        vectors=embeddings,
+        payload=payloads,
    )
-    print(response)


 if __name__ == '__main__':
--- a/code_search/index/upload_signatures.py
+++ b/code_search/index/upload_signatures.py
@ -2,8 +2,8 @@ import json
 from pathlib import Path

 import tqdm
-from qdrant_client import QdrantClient
-from qdrant_client.http.models import Distance, VectorParams
+from qdrant_client import QdrantClient, models
+from qdrant_client.models import Distance, VectorParams
 from sentence_transformers import SentenceTransformer

 from code_search.config import DATA_DIR, QDRANT_URL, QDRANT_API_KEY, QDRANT_NLU_COLLECTION_NAME, ENCODER_NAME, \
@ -54,6 +54,14 @@ def upload():
        vectors_config=VectorParams(
            size=ENCODER_SIZE,
            distance=Distance.COSINE,
+            on_disk=True,
+        ),
+        quantization_config=models.ScalarQuantization(
+            scalar=models.ScalarQuantizationConfig(
+                type=models.ScalarType.INT8,
+                always_ram=True,
+                quantile=0.99,
+            )
        )
    )

--- a/frontend/public/workflow.svg
+++ b/frontend/public/workflow.svg
--- a/frontend/src/components/CustomHeader/index.tsx
+++ b/frontend/src/components/CustomHeader/index.tsx
@ -25,7 +25,7 @@ export function CustomHeader() {
            variant="subtle"
            className={classes.link}
            component="a"
-            href="https://github.com/qdrant/demo-code-search/blob/master/README.md"
+            href="https://qdrant.tech/documentation/tutorials/code-search/"
            target="_blank"
            rel="noopener noreferrer"
          >
@ -64,7 +64,7 @@ export function CustomHeader() {
          <Title className={classes.modalHeader}>
            How does{" "}
            <Text component="span" className={classes.highlight} inherit>
-              Semantic search
+              Code search
            </Text>{" "}
            work?
          </Title>
@ -80,23 +80,16 @@ export function CustomHeader() {
          }}
        >
          <Text size="lg" color="dimmed" className={classes.description}>
-            The search page will allow users to search for code snippets using
-            natural language. The text input will be converted into a vector
-            representation using advanced machine learning techniques. This
-            vector will then be used to semantically search a code snippet
-            database, retrieving similar code based on its meaning and
-            functionality.
+            When you search a codebase, you might have the following objectives:
+            To find code snippets similar to what you're using, or to identify a method
+            that does <b>this specific thing</b>. Our code search demo supports
+            both cases with multiple embedding models.
          </Text>

          <Image src="/workflow.svg" />
          <Text size="lg" color="dimmed" className={classes.description}>
-            The search results will display code snippets that are most relevant
-            to the user's query, ranked by their similarity to the input text.
-            Users can view and compare the retrieved code snippets to find the
-            one that best suits their needs. This approach to code search aims
-            to improve the efficiency and accuracy of finding relevant code by
-            leveraging advanced natural language processing and machine learning
-            algorithms.
+            Using both embeddings helps us find not only the relevant method but also the
+            exact piece of code inside it. Semantic code intelligence in action, in context!
          </Text>
          <Button
            className={classes.modalBtnInner}
--- a/images/architecture-diagram.png
+++ b/images/architecture-diagram.png
--- a/images/code-search-ui.png
+++ b/images/code-search-ui.png
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@ -10,11 +10,11 @@ fastapi = "^0.75.2"
 uvicorn = "^0.17.6"
 torch = "^2.1.0"
 transformers = "^4.26.1"
-qdrant-client = "^1.6.3"
+qdrant-client = "^1.12.0"
 python-dotenv = "^0.21.1"
 sentence-transformers = "^2.2.2"
 tqdm = "^4.64.1"
-
+numpy = "^1.26.4,<2"

 [tool.poetry.dev-dependencies]

--- a/requirements.txt
+++ b/requirements.txt
@ -4,7 +4,8 @@ torchvision==0.16.0+cpu
 fastapi==0.75.2 ; python_version >= "3.9" and python_full_version < "3.11"
 uvicorn==0.17.6 ; python_version >= "3.9" and python_full_version < "3.11"
 transformers==4.26.1 ; python_version >= "3.9" and python_full_version < "3.11"
-qdrant-client==1.6.3 ; python_version >= "3.9" and python_full_version < "3.11"
+qdrant-client==1.12.0 ; python_version >= "3.9" and python_full_version < "3.11"
 python-dotenv==0.21.1 ; python_version >= "3.9" and python_full_version < "3.11"
 sentence-transformers==2.2.2 ; python_version >= "3.9" and python_full_version < "3.11"
 tqdm==4.64.1 ; python_version >= "3.9" and python_full_version < "3.11"
+numpy==1.26.4 ; python_version >= "3.9" and python_full_version < "3.11"
Author	SHA1	Message	Date
generall	e797e17099	upd dependencies and prevent single	2024-10-12 01:14:29 +02:00
generall	7f286cf5ed	update requirements and use scalar quantization	2024-05-29 11:04:03 +02:00
generall	5dbce33027	update links and about illustration	2024-05-15 23:56:18 +02:00
Kacper Łukawski	11e9551f85	Shorten the "About" text (#9 ) * Shorten the about text * Apply suggestions from code review Co-authored-by: Mike Jang <michael.jang@qdrant.io> --------- Co-authored-by: Mike Jang <michael.jang@qdrant.io>	2024-03-08 10:17:58 +01:00
Kacper Łukawski	5cb51d905b	Add a comprehensive README (#7 ) * Add a comprehensive README * Grammar fixes * Add models to architecture diagrams * Adjust wording * Apply suggestions from code review Co-authored-by: Mike Jang <michael.jang@qdrant.io> * Improve the README based on the code review --------- Co-authored-by: Mike Jang <michael.jang@qdrant.io>	2024-03-07 11:27:20 +01:00
generall	4d181ce7b0	fix ci	2024-03-05 21:19:39 +01:00
Kacper Łukawski	bd058ffee3	Merge pull request #8 from qdrant/about/improve-description Adjust the project description	2024-03-04 07:52:56 +01:00
Kacper Łukawski	4fd807e627	Fix typo	2024-03-01 15:45:59 +01:00
Kacper Łukawski	dddd775cd1	Adjust the project description	2024-03-01 15:44:54 +01:00
Kartik Gupta	89fb97593d	High light shift bug fixed (#6 )	2023-11-18 13:19:08 +01:00