ElasticSearch scoring / number of doc - elasticsearch

I have small (max 50 char) keywords stored in text field in ElasticSearch index. I noticed that if I clear the index and add only 1 document, let's say "samsung galaxy", the score when I match the document is like 0.95.
But when I add 500k other docs and I make the same query, the score is like 20. I would like to set a min_score for this query because I need a certain level of relevancy.
But as the score is depending of the doc count. I can't set a min_score as the number of docs in the index will constantly evolve.
I already looked for solutions like constant_score but I need the power of Elastic to give me a score (and not 1 or 0).
1) Does this behavior come from the IDF method or not only from it?
2) Is there a way to keep the current search algorythm (or just without the term frequency) and have always the same score for a query without doc count dependency ? This would allow me to set a min_score

Related

How to boost most popular (top quartile) in elasticsearch query results (outliers)

I have an elasticsearch query that includes bool - must / should sections that I have refined to match search terms and boost for terms in priority fields, phrase match, etc.
I would like to boost documents that are the most popular. The documents include a field "popularity" that indicates the number of times the document was viewed.
Preferably, I would like to boost any documents in the result set that are outliers - meaning that the popularity score is perhaps 2 standard deviations from the average in the result set.
I see aggregations but I'm interested in boosting results in a query, not a report/dashboard.
I also noted the new rank_feature query in ES 7 (I am still on 6.8 but could upgrade). It looks like the rank_feature query looks across all documents, not the result set.
Is there a way to do this?
I think that you want to use a rank or a range query in a "rescore query".
If your need is to specific for classical queries, you can use a "function_score" query in your rescore and use a script to write your own score calculation
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/filter-search-results.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-rescore.html

IDF recaculation for existing documents in index?

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

How do I compute facets/aggregations for the top n documents, with pagination in Elasticsearch?

Suppose I have an index for cars on a dealer's car lot. Each document resembles the following:
{
color: 'red',
model_year: '2015',
date_added: '2015-07-20'
}
Suppose I have a million cars.
Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars.
I could just use from and size to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on model_year and color (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set.
How do I limit my search to the most recently added 1000 documents for pagination and aggregation?
As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a match_all list of results. Even if you would use size at the query level, it will still not give you what you need because size is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches.
This feature request is not new and has been asked for before some time ago.
In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first terminate_after number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied.
In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the terminate_after is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after date_added and the query is just a match_all all the documents will have the same score and it will be returning an irrelevant set of documents.
In conclusion:
there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in sampler aggregation or with terminate_after and get a set of documents
my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.

Different queryNorm values in the result for the same query

The following query https://gist.github.com/anonymous/be27203a578494566a35 gives the following result set https://gist.github.com/anonymous/6935100dbf76b9a8f3e3. The documents has been indexed with the these settings https://gist.github.com/anonymous/ca42a7f67c7281935950.
As you can see, the queryNorm value for the documents in the result set varies. But according to the documentation (taken from http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#query-norm):
The same query normalization factor is applied to every document and you have no way of changing it. For all intents and purposes, it can be ignored.
Unfortunately, since this does not seem to be true (or maybe I have misunderstood something), I do not get the desired result set for the query above. More specifically, I would expect the second document to have a higher relevance than the first, since there is a higher boosting factor if the query matches the "name" field compared to the "subtype" field. But, because the queryNorm factor is lower for the second document, the relevance score gets in total lower.
Why does the queryNorm behave this way?
Is there really no way of disabled it? (i.e setting the factor to 1)
I am running version 1.4.0 of Elasticsearch.

Limiting aggreation to the top X hits in elasticsearch

ElasticSearch builds the aggregation results based on all the hits of the query independently of the from and size parameters. This is what we want in most cases, but I have a particular case in which I need to limit the aggregation to the top N hits. The limits filter is not suitable as it does not fetch the best N items but only the first X matching the query (per shard) independently of their score.
Is there any way to build a query whose hit count has an upper limit N in order to be able to build an aggregation limited to those top N results? And if so how?
Subsidiary question: Limiting the score of matching documents could be an alternative even though in my case I would require a fixed bound. Does the min_score parameter affect aggregation?
You are looking for Sampler Aggregation.
I have a similar answer explained here
Optionally, you can use the field or script and max_docs_per_value
settings to control the maximum number of documents collected on any
one shard which share a common value.
If you are using an ElasticSearch cluster with version > 1.3, you can use top_hits aggregation by nesting it in your aggregation, ordering on the field you want and set the size parameter to X.
The related documentation can be found here.
I need to limit the aggregation to the top N hits
With nested aggregations, your top bucket can represent those N hits, with nested aggregations operating on that bucket. I would try a filter aggregation for the top level aggregation.
The tricky part is to make use the of _score somehow in the filter and to limit it exactly to N entries... There is a limit filter that works per shard, but I don't think it would work in this context.
It looks like Sampler Aggregation can now be used for this purpose. Note that it is only available as of Elastic 2.0.

Resources