Text Semantic Search
Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.
The idea behind semantic search is to embed all entries in your corpus, whether sentences, paragraphs, or documents, into a vector space.
The query is embedded into the same vector space at search time, and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.
Embeddings
In machine learning, an embedding is a representation of objects (such as words, items, or entities) in a lower-dimensional vector space. Embeddings are used to capture the inherent relationships and similarities between objects, making it easier for machine learning models to work with and understand complex structures in the data. Embeddings are commonly used in natural language processing (NLP), recommendation systems, and other applications where understanding the relationships between items is crucial.
Model Selection
The following Google Colab link displays the use of Sbert's all-MiniLM-L6-v2 model. It's specifications are below:
Similarity Search
Similarity search can be carried out broadly in 3 ways:
Cosine Similarity:
Used in vector spaces, especially prevalent in natural language processing. It measures the cosine of the angle between two vectors. It ranges from -1 (completely opposite) to 1 (identical).
Euclidean Distance:
Measures the straight-line distance between two points in space. Lower distance implies higher similarity.
Dot Product Similarity:
based on the dot product operation. similarity ranges from -1 to 1. It is 1 for identical vectors, -1 for diametrically opposed vectors, and 0 for orthogonal vectors.
For this exercise, the model suggests using cosine-similarity or dot-product. I chose to use cosine similarity to get the top 5 best results. (k Nearest Neighbour)