Text Semantic Search

Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

The idea behind semantic search is to embed all entries in your corpus, whether sentences, paragraphs, or documents, into a vector space.

The query is embedded into the same vector space at search time, and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

Embeddings

In machine learning, an embedding is a representation of objects (such as words, items, or entities) in a lower-dimensional vector space. Embeddings are used to capture the inherent relationships and similarities between objects, making it easier for machine learning models to work with and understand complex structures in the data. Embeddings are commonly used in natural language processing (NLP), recommendation systems, and other applications where understanding the relationships between items is crucial.

Model Selection

The following Google Colab link displays the use of Sbert's all-MiniLM-L6-v2 model. It's specifications are below: 

Similarity Search

Similarity search can be carried out broadly in 3 ways: 

For this exercise, the model suggests using cosine-similarity or dot-product. I chose to use cosine similarity to get the top 5 best results. (k Nearest Neighbour)