Compare commits

..

1 Commits
master ... fix

Author SHA1 Message Date
kartik-gupta-ij 327478758d High light shift bug fixed 2023-11-18 01:39:16 +05:30
11 changed files with 1036 additions and 1456 deletions

View File

@ -24,7 +24,6 @@ jobs:
cd $GITHUB_WORKSPACE
docker build -t qdrant/code-search-web:${{ github.sha }} .
docker save -o code-search-web.tar qdrant/code-search-web:${{ github.sha }}
chmod 666 code-search-web.tar
ls -al .
- name: copy data with ssh
uses: appleboy/scp-action@master

242
README.md
View File

@ -1,241 +1,9 @@
# Code search with Qdrant
# Semantic Search for code - demo
Developers need a code search tool that helps them find the right piece of code. In this README, we describe how
you can set up a tool that provides code results, in context.
## Overview
## Online version
ToDo
See our code search tool "in action." Navigate to
**[https://code-search.qdrant.tech/](https://code-search.qdrant.tech/)**. We've prepopulated the demo with Qdrant
codebase. You can see the results, in context, even with relatively vague search terms.
## How to run
## Prerequisites
To run this demo on your own system, install and/or set up the following components:
- [Docker](https://www.docker.com/)
- [Docker Compose](https://docs.docker.com/compose/)
- [Rust](https://www.rust-lang.org/learn/get-started)
- [rust-analyzer](https://rust-analyzer.github.io/)
Docker and Docker Compose setup depends on your operating system. Please refer to the official documentation for
instructions on how to install them. Both Rust and rust-analyzer can be installed with the following commands:
```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup component add rust-analyzer
```
## Description
You can set up [Qdrant](https://qdrant.tech) to help developers find the code they need, with context. Using semantic
search, developers can find the code samples that can help them do their day-to-day work, even with:
- Imprecise keywords
- Inexact names for functions, classes or variables
- Some other code snippets
The demo uses [Qdrant source code](https://github.com/qdrant/qdrant) to build an end-to-end code search application that
helps you find the right piece of code, even if you have never contributed to the project. We implemented an end-to-end
process, including data chunking, indexing, and search. Code search is a very specific task in which the programming
language syntax matters as much as the function, class, variable names, and the docstring, describing what and why.
While the latter is more of a traditional natural language processing task, the former requires a specific approach.
Thus, we use the following neural encoders for our use cases:
- `all-MiniLM-L6-v2` - one of the gold standard models for natural language processing
- `microsoft/unixcoder-base` - a model trained specifically on a code dataset
### Chunking and indexing process
Semantic search works best with _structured_ source code repositories, with good syntax, as well as best practices
as defined by the authoring team. If your code base needs help, start by dividing the code into chunks. Each
chunk should correspond to a specific function, struct, enum, or any other code structure that might be considered as a whole.
There is a separate model-specific logic that extracts the most important parts of the code and converts them
into a format that the neural network can understand. Only then, the encoded representation is indexed in the Qdrant
collection, along with a JSON structure describing that snippet as a payload.
To that end, we work with the following models. The combination is the "best of both worlds."
#### all-MiniLM-L6-v2
Before the encoding, code is divided into chunks, but contrary to the traditional NLP challenges, it contains not only
the definition of the function or class but also the context in which appears. While doing code search it's important
to know where the function is defined, in which module, and in which file. This information is crucial to present the
results to the user in a meaningful way.
For example, the `upsert` function from one of Qdrant's modules would be represented as the following structure:
```json
{
"name": "upsert",
"signature": "fn upsert (& mut self , id : PointOffsetType , vector : SparseVector)",
"code_type": "Function",
"docstring": "= \" Upsert a vector into the inverted index.\"",
"line": 105,
"line_from": 104,
"line_to": 125,
"context": {
"module": "inverted_index",
"file_path": "lib/sparse/src/index/inverted_index/inverted_index_ram.rs",
"file_name": "inverted_index_ram.rs",
"struct_name": "InvertedIndexRam",
"snippet": " /// Upsert a vector into the inverted index.\n pub fn upsert(&mut self, id: PointOffsetType, vector: SparseVector) {\n for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {\n let dim_id = dim_id as usize;\n match self.postings.get_mut(dim_id) {\n Some(posting) => {\n // update existing posting list\n let posting_element = PostingElement::new(id, weight);\n posting.upsert(posting_element);\n }\n None => {\n // resize postings vector (fill gaps with empty posting lists)\n self.postings.resize_with(dim_id + 1, PostingList::default);\n // initialize new posting for dimension\n self.postings[dim_id] = PostingList::new_one(id, weight);\n }\n }\n }\n // given that there are no holes in the internal ids and that we are not deleting from the index\n // we can just use the id as a proxy the count\n self.vector_count = max(self.vector_count, id as usize);\n }\n"
}
}
```
> Please note that this project aims to create a search mechanism specifically for Qdrant source code written in Rust.
Thus, we built a small separate [rust-parser project](https://github.com/qdrant/rust-parser) that converts it into the
before-mentioned JSON objects. It uses [Syn](https://docs.rs/syn/latest/syn/index.html) to read the syntax tree of the
codebase. If you want to replicate the project for a different programming language, you will need to build a similar
parser for that language. For example, Python has a similar library called [ast](https://docs.python.org/3/library/ast.html),
but there might be some differences in the way the code is parsed, thus some adjustments might be required.
Since the `all-MiniLM-L6-v2` model is trained for more natural language tasks, it won't be able to understand the
code directly. For that reason, **we build a fake text-like representation of the structure, that should be
understandable for the model**, or its tokenizer to be more specific. Such representation won't contain the actual code,
but rather the important parts of it, like the function name, its signature, and the docstring, but also many more. All
the special, language-specific characters are removed, to keep the names and signatures as clean as possible. Only that
representation is then passed to the model.
For example, the `upsert` function from the example above would be represented as:
```python
'Function upsert that does: = " Upsert a vector into the inverted index." defined as fn upsert mut self id Point Offset Type vector Sparse Vector in struct InvertedIndexRam in module inverted_index in file inverted_index_ram.rs'
```
In the properly structured codebase, both module and file names should carry some additional information about the
semantics of that piece of code. For example, the `upsert` function is defined in the `InvertedIndexRam` struct, which
is a part of the `inverted_index`, which indicates that it is a part of the inverted index implementation stored in
memory. It is unclear from the function name itself.
> If you want to see how the conversion is implemented in general, please check the `textify` function in the
`code_search.index.textifier` module.
#### microsoft/unixcoder-base
In that case, the model focuses specifically on the code snippets. We take the definitions along with the corresponding
docstrings and pass them to the model. Extracting all the definitions is not a trivial task, but there are various
Language Server Protocol (**LSP**) implementations that can help with that, and you should be able to [find one for
your programming language](https://microsoft.github.io/language-server-protocol/implementors/servers/). For Rust, we
used the [rust-analyzer](https://rust-analyzer.github.io/) that is capable of converting the codebase into the [LSIF
format](https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/), which is a
universal, JSON-based format for code, regardless of the programming language.
The same `upsert` function from the example above would be represented in LSIF as multiple entries and won't contain
the definition itself but just the location, so we have to extract it from the source file on our own.
Even though the `microsoft/unixcoder-base` model does not officially support Rust, we found it to be working quite well
for the task. Obtaining the embeddings for the code snippets is quite straightforward, as we just send the code snippet
directly to the model:
```rust
/// Upsert a vector into the inverted index.
pub fn upsert(&mut self, id: PointOffsetType, vector: SparseVector) {
for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {
let dim_id = dim_id as usize;
match self.postings.get_mut(dim_id) {
Some(posting) => {
// update existing posting list
let posting_element = PostingElement::new(id, weight);
posting.upsert(posting_element);
}
None => {
// resize postings vector (fill gaps with empty posting lists)
self.postings.resize_with(dim_id + 1, PostingList::default);
// initialize new posting for dimension
self.postings[dim_id] = PostingList::new_one(id, weight);
}
}
}
// given that there are no holes in the internal ids and that we are not deleting from the index
// we can just use the id as a proxy the count
self.vector_count = max(self.vector_count, id as usize);
}
```
Having both encoders should help us build a more robust search mechanism, that can handle both the natural language and
code-specific queries.
### Search process
The search process is quite straightforward. The user input is passed to both encoders, and the resulting vectors are
used to query both Qdrant collections at the same time. The results are then merged with duplicates removed and returned
back to the user.
## Architecture
The demo uses [FastAPI](https://fastapi.tiangolo.com/) framework for the backend and [React](https://reactjs.org/) for
the frontend layer.
![Architecture of the code search demo](images/architecture-diagram.png)
The demo consists of the following components:
- [React frontend](/frontend) - a web application that allows the user to search over Qdrant codebase
- [FastAPI backend](/code_search/service.py) - a backend that communicates with Qdrant and exposes a REST API
- [Qdrant](https://qdrant.tech/) - a vector search engine that stores the data and performs the search
- Two neural encoders - one trained on the natural language and one for the code-specific tasks
There is also an additional indexing component that has to be run periodically to keep the index up to date. It is also
part of the demo, but it is not directly exposed to the user. All the required scripts are documented below, and you can
find them in the [`tools`](/tools) directory.
The demo is, as always, open source, so feel free to check the code in this repository to see how it is implemented.
## Usage
As every other semantic search system, the demo requires a few steps to be set up. First of all, the data has to be
ingested, so we can then use the created index for our queries.
### Data indexing
Qdrant is used as a search engine, so you will need to have it running somewhere. You can either use the local container
or the Cloud version. If you want to use the local version, you can start it with the following command:
```shell
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
```
However, the easiest way to start using Qdrant is to use our Cloud version. You can sign up for a free tier 1GB cluster
at [https://cloud.qdrant.io/](https://cloud.qdrant.io/).
Once the environment is set up, you can configure the Qdrant instance and build the index by running the following
commands:
```shell
export QDRANT_URL="http://localhost:6333"
# For the Cloud service you need to specify the api key as well
# export QDRANT_API_KEY="your-api-key"
bash tools/download_and_index.sh
```
The indexing process might take a while, as it needs to encode all the code snippets and send them to the Qdrant.
### Search service
Once the index is built, you can start the search service by running the following commands:
```shell
docker-compose up
```
The UI will be available at [http://localhost:8000/](http://localhost:8000/). This is how it should look like:
![Code search with Qdrant](images/code-search-ui.png)
You can type in the search query and see the related code structures. Queries might come both from natural language
but also from the code itself.
## Further steps
If you would like to take the demo further, you can try to:
1. Disable one of the neural encoders and see how the search results change.
2. Try out some other encoder models and see the impact on the search quality.
3. Fork the project and support programming languages other than Rust.
4. Build a ground truth dataset and evaluate the search quality.
ToDo

View File

@ -74,24 +74,19 @@ def encode_and_upload():
vectors_config=rest.VectorParams(
size=len(embeddings[1]),
distance=rest.Distance.COSINE,
on_disk=True,
),
quantization_config=rest.ScalarQuantization(
scalar=rest.ScalarQuantizationConfig(
type=rest.ScalarType.INT8,
always_ram=True,
quantile=0.99,
)
)
)
print(f"Storing data in the collection {collection_name}")
client.upload_collection(
response = client.upsert(
collection_name=collection_name,
points=rest.Batch(
ids=[i for i, _ in enumerate(embeddings)],
vectors=embeddings,
payload=payloads,
payloads=payloads,
),
)
print(response)
if __name__ == '__main__':

View File

@ -2,8 +2,8 @@ import json
from pathlib import Path
import tqdm
from qdrant_client import QdrantClient, models
from qdrant_client.models import Distance, VectorParams
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from sentence_transformers import SentenceTransformer
from code_search.config import DATA_DIR, QDRANT_URL, QDRANT_API_KEY, QDRANT_NLU_COLLECTION_NAME, ENCODER_NAME, \
@ -54,14 +54,6 @@ def upload():
vectors_config=VectorParams(
size=ENCODER_SIZE,
distance=Distance.COSINE,
on_disk=True,
),
quantization_config=models.ScalarQuantization(
scalar=models.ScalarQuantizationConfig(
type=models.ScalarType.INT8,
always_ram=True,
quantile=0.99,
)
)
)

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 9.5 KiB

View File

@ -25,7 +25,7 @@ export function CustomHeader() {
variant="subtle"
className={classes.link}
component="a"
href="https://qdrant.tech/documentation/tutorials/code-search/"
href="https://github.com/qdrant/demo-code-search/blob/master/README.md"
target="_blank"
rel="noopener noreferrer"
>
@ -64,7 +64,7 @@ export function CustomHeader() {
<Title className={classes.modalHeader}>
How does{" "}
<Text component="span" className={classes.highlight} inherit>
Code search
Semantic search
</Text>{" "}
work?
</Title>
@ -80,16 +80,23 @@ export function CustomHeader() {
}}
>
<Text size="lg" color="dimmed" className={classes.description}>
When you search a codebase, you might have the following objectives:
To find code snippets similar to what you're using, or to identify a method
that does <b>this specific thing</b>. Our code search demo supports
both cases with multiple embedding models.
The search page will allow users to search for code snippets using
natural language. The text input will be converted into a vector
representation using advanced machine learning techniques. This
vector will then be used to semantically search a code snippet
database, retrieving similar code based on its meaning and
functionality.
</Text>
<Image src="/workflow.svg" />
<Text size="lg" color="dimmed" className={classes.description}>
Using both embeddings helps us find not only the relevant method but also the
exact piece of code inside it. Semantic code intelligence in action, in context!
The search results will display code snippets that are most relevant
to the user's query, ranked by their similarity to the input text.
Users can view and compare the retrieved code snippets to find the
one that best suits their needs. This approach to code search aims
to improve the efficiency and accuracy of finding relevant code by
leveraging advanced natural language processing and machine learning
algorithms.
</Text>
<Button
className={classes.modalBtnInner}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 267 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 125 KiB

2127
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -10,11 +10,11 @@ fastapi = "^0.75.2"
uvicorn = "^0.17.6"
torch = "^2.1.0"
transformers = "^4.26.1"
qdrant-client = "^1.12.0"
qdrant-client = "^1.6.3"
python-dotenv = "^0.21.1"
sentence-transformers = "^2.2.2"
tqdm = "^4.64.1"
numpy = "^1.26.4,<2"
[tool.poetry.dev-dependencies]

View File

@ -4,8 +4,7 @@ torchvision==0.16.0+cpu
fastapi==0.75.2 ; python_version >= "3.9" and python_full_version < "3.11"
uvicorn==0.17.6 ; python_version >= "3.9" and python_full_version < "3.11"
transformers==4.26.1 ; python_version >= "3.9" and python_full_version < "3.11"
qdrant-client==1.12.0 ; python_version >= "3.9" and python_full_version < "3.11"
qdrant-client==1.6.3 ; python_version >= "3.9" and python_full_version < "3.11"
python-dotenv==0.21.1 ; python_version >= "3.9" and python_full_version < "3.11"
sentence-transformers==2.2.2 ; python_version >= "3.9" and python_full_version < "3.11"
tqdm==4.64.1 ; python_version >= "3.9" and python_full_version < "3.11"
numpy==1.26.4 ; python_version >= "3.9" and python_full_version < "3.11"