Grammar fixes

2024-02-28 11:39:19 +01:00 · 2024-02-28 11:39:19 +01:00 · b55ac32fbd
parent 0c0258d45c
commit b55ac32fbd
1 changed files with 18 additions and 22 deletions
--- a/README.md
+++ b/README.md
@ -7,16 +7,16 @@ If you want to start with having a look at a running application, it is availabl

 ## Description

-[Qdrant](https://qdrant.tech) is not only created to support the engineers in building reliable and efficient semantic 
-search experience, but it may also be used to ease their day-to-day work. Nowadays, the amount of code required to 
+[Qdrant](https://qdrant.tech) is not only created to support engineers in building reliable and efficient semantic 
+search experiences but it may also be used to ease their day-to-day work. Nowadays, the amount of code required to 
 support a single project is enormous, and it is growing every day. It is hard to keep track of all the code, and it is
-even harder to find the right piece of code when you need it. **Keywords do not always work, and sometimes you need to
+even harder to find the right piece of code when you need it. **Keywords do not always work, and sometimes, you need to
 find a piece of code that does a specific thing, but you do not remember the exact name of the function or class.**
 That sounds like a perfect task for a semantic search engine, and Qdrant is here to help.

 The demo uses [Qdrant source code](https://github.com/qdrant/qdrant) to build an end-to-end code search application that
-helps to find the right piece of code, even if you have never contributed to the project. We implemented an end-to-end
-process, including data chunking, indexing, and search. Code search is a very specific task, in which the programming 
+helps you find the right piece of code, even if you have never contributed to the project. We implemented an end-to-end
+process, including data chunking, indexing, and search. Code search is a very specific task in which the programming 
 language syntax matters as much as the function, class, variable names, and the docstring, describing what and why. 
 While the latter is more of a traditional natural language processing task, the former requires a specific approach. 
 Thus, we used two separate neural encoders to handle both cases
@ -26,28 +26,24 @@ Thus, we used two separate neural encoders to handle both cases

 ### Chunking and indexing process

-Source code might be considered a semi-structured data, as it has an inherent structure, defined by the programming
+Source code might be considered semi-structured data, as it has an inherent structure, defined by the programming
 language syntax, and some good practices of the team that created it. If your codebase doesn't seem to have any 
 structure, it might be a good idea to start with some refactoring, before building a search mechanism on top of it. 
 However, that already gives us a hint on how to approach the problem. We can divide the code into chunks, with each
-chunk being a specific function, struct, enum or any other code structure that might be considered as a whole.
+chunk being a specific function, struct, enum, or any other code structure that might be considered as a whole.

-[//]: # (Each chunk is encoded by both neural encoders, however the JSON object above is not directly sent to the network. )
-
-[//]: # (Instead, there is a separate model-specific logic that extracts the most important parts of the code and converts them)
-
-[//]: # (into a format that the neural network can understand. Only then, the encoded representation is indexed in Qdrant )
-
-[//]: # (collection, along with the original JSON structure as a payload.)
+There is a separate model-specific logic that extracts the most important parts of the code and converts them
+into a format that the neural network can understand. Only then, the encoded representation is indexed in the Qdrant 
+collection, along with a JSON structure describing that snippet as a payload.

 #### all-MiniLM-L6-v2

 Before the encoding, code is divided into chunks, but contrary to the traditional NLP challenges, it contains not only
-the definition of the function or class, but also the context in which appears. While doing code search it's important 
+the definition of the function or class but also the context in which appears. While doing code search it's important 
 to know where the function is defined, in which module, and in which file. This information is crucial to present the
 results to the user in a meaningful way.

-For example, the `upsert` function from one of the Qdrant's modules would be represented as the following structure:
+For example, the `upsert` function from one of Qdrant's modules would be represented as the following structure:

 ```json
 {
@ -90,8 +86,8 @@ For example, the `upsert` function from the example above would be represented a

 In the properly structured codebase, both module and file names should carry some additional information about the 
 semantics of that piece of code. For example, the `upsert` function is defined in the `InvertedIndexRam` struct, which 
-is a part of the `inverted_index`, what indicates that it is a part of the inverted index implementation stored in 
-memory, what is unclear from the function name itself.
+is a part of the `inverted_index`, which indicates that it is a part of the inverted index implementation stored in 
+memory. It is unclear from the function name itself.

 > If you want to see how the conversion is implemented in general, please check the `textify` function in the 
 `code_search.index.textifier` module.
@ -104,10 +100,10 @@ Language Server Protocol (**LSP**) implementations that can help with that, and
 your programming language](https://microsoft.github.io/language-server-protocol/implementors/servers/). For Rust, we 
 used the [rust-analyzer](https://rust-analyzer.github.io/) that is capable of converting the codebase into the [LSIF 
 format](https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/), which is a 
-universal, JSON-based format for code, no matter the programming language.
+universal, JSON-based format for code, regardless of the programming language.

-The same `upsert` function from the example above would be represented in LSIF as multiple entries, and won't contain
-the definition itself, but just the location of it, so we have to extract it from the source file on our own. 
+The same `upsert` function from the example above would be represented in LSIF as multiple entries and won't contain
+the definition itself but just the location, so we have to extract it from the source file on our own. 

 Even though the `microsoft/unixcoder-base` model does not officially support Rust, we found it to be working quite well 
 for the task. Obtaining the embeddings for the code snippets is quite straightforward, as we just send the code snippet 
@ -163,7 +159,7 @@ There is also an additional indexing component that has to be run periodically t
 part of the demo, but it is not directly exposed to the user. All the required scripts are documented below, and you can 
 find them in the [`tools`](/tools) directory.

-Demo is, as always, open source, so feel free to check the code in this repository to see how it is implemented.
+The demo is, as always, open source, so feel free to check the code in this repository to see how it is implemented.

 ## Usage