So I read a bit about NDCG and how it computes the retrieved relevance in information retrieval/recommender systems. All examples I read about it have specific value's for item relevance and what relevance the model predicts.
But how do recsys measure NDCG when the relevance is unknown? When we only know implicit interactions from the user, so we can only assume similar relevance to those items. Also, the model output is a ranked list of items without a relevance score to each item, do we infer the model predicted relevance as the inverse of the index in the recommendation list?
For example, if user listens to songs [1,2,3] and they got songs [5,4,3,1] recommended to them. How do you calculate the NDCG score of the recommendation list?
Related
How can we evaluate the ranking of results for an Information Retrieval system in an Unsupervised scenario?
A way to estimate the quality of the retrieved information without the presence of relevance assessments is with the help of Query Performance Prediction (or QPP for short). There exists a considerable volume of work on QPP in the IR literature, which you can dig up from the SIGIR/CIKM conferences.
Broadly speaking, it uses the idea that a top-retrieved set of documents, if significantly different from the collection, is a reasonable indication that the top retrieved set is focussed on a specific topic, and hence likely to be relevant because essentially relevance is a property which is also supposed to be focused on a particular topic (this is just an assumption but this is the best we can do without assessments).
A simple technique to estimate the distinctive nature of the top-k documents then would be to check the skewness of these scores -- the more skewed they are, the higher is the likelihood of the top-k being different from the rest (and hence the retrieval being good).
The figure below (taken from this TOIS paper) shows how standard deviation can be used as a measure of (inverse) skewness. The std_dev of the left distribution is less (the value is closer to the average), so this is an example of a query for which the system hasn't been able to retrieve useful documents.
In contrast to the standard usage of QPP which compares between two queries, in your case the query is fixed and you basically would compare across retrieval models (e.g. the score distribution with tf-idf could be less skewed than BM25).
I wonder whether relevance score in elasticsearch has differences with couchbase or not?
As per this 2019 couchbase thread, it looks like they are still using the tf/idf for scoring, while Elasticsearch used to have the same algorithm but now moved to BM25 algorithm for score calculation from 5.0.
Note: TF/IDF is a very popular algorism for calculating the relevance score and based on term frequency and inverse document frequency, while BM25 is the latest and improvised form based on probabilistic scoring more details about them can be found here and here.
Note: As in the question, it's not mentioned for what purpose you are comparing both the relevance of the system, My two cents are if you are building a full-blown search system and relevance matters for you, then you should choose Elasticseaech whose primary function is to search and has a lot of flexibility in choosing different algorithm and different ways to define the scoring mechanism, which is not present in NoSQL solution like Couchbase.
I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.
I'm new to elastic search. I'm having trouble understanding the calibration and scaling of boost values for fields in a document. As in how should we decide the boosting values for field so that it works as expected. I've gone through some of the online blogs and es doc as well, it's written that es does normalization and internal optimization of boosting values? How does that work?
E.g.: If we have tags, title, name and text fields in our doc, how should we decide the boosting values for these?
Elasticsearch uses a boolean model to match documents, and then a scoring model to determine relevance (i.e. ranking). The scoring model utilizes a TF/IDF score, coupled with some additional features. Those TF/IDF scores are calculated for each matching field within a query, and then aggregated to produce an overall score for a document. To dig into this process, I suggest running explain on your query to see how the score of each field is influencing the overall relevance of your document.
As the expert on your data, you're in the best position to determine which fields should most heavily influence the relevance of your document. Finding the right boost value for a field is about adjusting the levers until you find a formula that best suites your desired outcome (Also, if you have users, A/B testing can help here).
I have a list of keywords taken from 95 documents. I'd like to rank their importance, but I have only the number of documents in which the keywords appear and the maximum frequency of a keyword among all the documents. I'm looking for a ranking formula that could help. At the moment I'm using IDF, but I'd like to know if there is any better formula.
word frequency is already done by listing the most important words in English ( and many other langs ) by Wikitionary Frequency Lists which has many type of lists based on the most important and top words, besides the TV and Movies most frequent words and many others.
If you like to do some algorithm based on word ranking I would suggest you don't go far away from
TF-IDF
and here you can find the Latent semantic indexing algorithm which might me an asset for you.
Hope that is what you needed.
TF-IDF is definitely a good base and easy to implement.
It is also really common to add other bias such as the position of your terms inside your documents; a term occurring at the beginning of a document, or better, in its title tends to be more relevant than the ones occurring in the middle or at the end.
But you have to keep in mind that choosing an algorithm and its bias also depends on the nature of your documents. For instance, long documents (e.g. research papers or books) would need a position bias, but not necessarily news articles. Same thing for the "IDF" measure, it has to be computed on a large corpus of documents with similar type of content than your documents. You don't want to have a relevancy score computed on a "TV and Movies" corpus if for instance your documents are research papers about semi-conductors.
My two cents.