Full Colab NoteBook on Github.
Evaluating information retrieval (IR) systems is essential for informed design decisions and understanding system performance. Successful companies like Amazon and Google rely heavily on IR systems, with Amazon attributing over 35% of sales and Google 70% of YouTube views to their recommender systems. Effective evaluation measures are key to building and refining these systems.
Here, we will use Normalized Discounted Cumulative Gain (NDCG@K) to evaluate the performance of both the base model and the fine-tuned model, assessing whether the fine-tuned model outperforms the base model.
Redis vector database will serve as a persistent store for embedding and RedisVL as a python client library.
Normalized Discounted Cumulative Gain (NDCG) evaluates retrieval quality by assigning ground truth ranks to database elements based on relevance. For example, highly relevant results might rank 5, partially relevant ones 2–4, and irrelevant ones 1. NDCG sums the ranks of retrieved items but introduces a log-based penalty to account for result order. Irrelevant items ranked higher incur greater penalties, ensuring the system rewards placing relevant items earlier in the results.
NDCG improves upon Cumulative Gain (CG) by accounting for the importance of rank positions, as users prioritize top results. It uses a discount factor to give higher weight to top-ranked items.
To calculate NDCG:
– rel i: Relevance score of the document at position i.
– i: Rank position of the document (1-based index).
– log2(i+1): Discount factor that reduces the impact of lower-ranked items.
Relevance scores quantify how useful a retrieved item is for a given query. Scores can be defined based on the importance of the result:
Graded relevance captures varying degrees of relevance, offering richer feedback for evaluating rankings. It reflects that some results may be useful but not perfect, while others are completely irrelevant.
Let’s get back to our steps to calculate NDCG.
Limitations of NDCG:
The objective is to compute embeddings for all answers using both the base and fine-tuned embedding models and store them in two separate vector indexes. These embeddings will be used to retrieve answers based on the corresponding questions. After retrieval, the NDCG@K score will be calculated for both sets of embeddings to evaluate their performance.
Download BGE base model (BAAI/bge-base-en-v1.5) from Huggingface
The following code defines a schema for an index with two fields: a “tag” field (qa) and a “vector” field (embedding) configured for high-dimensional data search with specified attributes like algorithm, data type, dimensionality, and distance metric.
The following script creates a Redis index, processes QA pairs by generating vector embeddings for the answers, and loads the data into the index for search and retrieval.
The following script performs vector searches for each question in the QA dataset, calculates metrics like NDCG scores and match rankings, and tracks how often the correct answer is found at each position in the results. The final output includes key metrics for evaluating the model’s performance.
Here’s a summary of the results:
Rank 1: 132 matches
Rank 2: 8 matches
Rank 3: 5 matches
Rank 4: 8 matches
Rank 5: 3 matches
Ranks 6-10: 0 matches each
This indicates that the model performs relatively well, with the correct answer being frequently ranked highly (e.g., 132 times in the top position), but the performance drops significantly in the lower ranks. The average NDCG score is 0.49, suggesting room for improvement.
Download BGE fine-tuned model (rezarahim/bge-finetuned-detail) from Huggingface.
Detail of this portion of the code can be found in the notebook.
Here’s a summary of the fine-tuned model results:
Here’s a summary of the updated results:
Rank 1: 166 matches
Rank 2: 8 matches
Rank 3: 2 matches
Rank 4: 2 matches
Ranks 5-10: 0 matches each
This marks a clear improvement over previous results. The fine-tuned model frequently places the correct answer in the top rank (166 times) and achieves an average NDCG score of 0.60, showing better ranking performance and overall accuracy.