What actually norms store in Elasticsearch - elasticsearch

I came across a mapping where, on some fields, which uses custom analyzer, norms are disabled.
Then I read about Norms and https://www.elastic.co/guide/en/elasticsearch/reference/current/norms.html found this official doc, but it doesn't explain clearly what exactly it stores and how actually its useful in scoring.
Below is the snippet from above link:
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
I found some other docs which gave some more information and advised to Disable Norms for Analyzed Fields like numbers to represent the relative field length and the index time boost setting. But still I am unable to understand it completely.
So, In short I have below doubts:
What exactly norms store?
What exactly is relative field length and how it's useful for scoring?
Default value of norms?
Can I see the content of norms using some ES query?

here is ma attempt of answer :)
What exactly norms store and What exactly is relative field length and how it's useful for scoring?
it stores information that allows elastic to know the relative field length. Why ?
How long is the field? The shorter the field, the higher the weight.
If a term appears in a short field, such as a title field, it is more
likely that the content of that field is about the term than if the
same term appears in a much bigger body field
Default value of norms?
Norms are activated on text field and disabled on other fields.
Can I see the content of norms using some ES query?
No, norms are stored in the segment's data. But you can see the impact of the norms if you use the explain flag in your request. Somewhere in the score explanation mess you will see some thing like that :
{
"value": 1.4506965,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 3,
"description": "termFreq=3.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 34.572754,
"description": "avgFieldLength",
"details": []
},
{
"value": 48,
"description": "fieldLength",
"details": []
}
]
}
where fieldLength and avgFieldLength are computed thanks to the norms data
This answer is primary based on https://www.elastic.co/fr/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables and https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm

Related

Elasticsearch date based function scoring boosting the wrong way

I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?
This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)

Finding all words and their frequencies in an elasticsearch index

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.
I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat
Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Elasticsearch: how to know which field the results are sorted by?

In Elasticsearch, is there any way to check which field the results are sorted by? I want something like inner-hits for sort clause.
Imagine that your documents have this kind of form:
{"numerals" : [ // nested
{"key": "point", "value": 30},
{"key": "points", "value": 200},
{"key": "score", "value": 20},
{"key": "scores", "value": 40}
]
}
and you sort the results by:
{"numerals.value": {
"nested_path": "numerals",
"nested_filter": {
"match": {
"numerals.key": "score"}}}}
Now I have no idea how to know the field by which the results are actually sorted: it's probably scores at this document, but is perhaps score at the others? There are 2 problems - 1. You cannot use inner-hits nor highlight for the nested fields. and - 2. Even if you can, it doesn't solve the issue if there are multiple matching candidates.
The question is about sorting by fields that are inside nested objects.
So this is what the documention
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-sorting.html
and
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_nested_sorting_example
says:
Elasticsearch will first restrict the nested documents by the "nested_filter"-query and then sort on the same way as for multi-valued fields:
Exactly the way as if there would be only the filtered nested documents as inner objects aka as if there would be only the root document with a multi-valued field which contains exactly all value which belong to the filtered nested objects
( in your example there will only one value remain: 20).
If you want to be sure about the sort order insert a "mode" parameter:
"min", "max", "sum", "avg" or "median"
If you do not specify the "mode" parameter according to the corresponding issue the min-value will be picked for "asc" and the max-value will be picked for "desc"-order:
By default when sorting on a multi-valued field the lowest or highest
value will be picked from the field values depending on the sort
order.

Difference between score and explained score

I get invalid search results every time with elasticsearch. I ran a query with explain: true and checked results. I was surprised that 'messy' output entries has different score and explained score:
"_score": 0.32287252,
...
"_explanation": {
"value": 1.6143626,
"description": "product of:",
...
If those (messy) entries had explained score's value in _score, output would look perfect. Does anybody know how to fix this?
PS: I tried to change the number of shards from 5 to 1: nothing changes, the output is still invalid.

Require a number of matches against text in ElasticSearch

I'm trying to create a filter against ElasticSearch that requires more than one match before the result is returned. For example, in the following text:
If you're uneasy at the idea of riding in a vehicle that drives itself, just wait till you see Google's new car. It has no gas pedal, no brake and no steering wheel. Google has been demonstrating its driverless technology for several years by retrofitting Toyotas, Lexuses and other cars with cameras and sensors. But now, for the first time, the company has unveiled a prototype of its own: a cute little car that looks like a cross between a VW Beetle and a golf cart.
If I set the minimum number of matches to 2 and searched for Google, I would expect this result because Google appears in the text twice. However, searching on Toyota with the same number of expected matches should not result in this article.
How do I construct this filter?
Probably not exactly what you are looking for, but you could add explain to your query and then filter on the client side by number of term matches. From the docs, query would look like this:
GET /_search?explain
{
"query" : { "match" : { "tweet" : "honeymoon" }}
}
Results would look like this:
"_explanation": {
"description": "weight(tweet:honeymoon in 0)
[PerFieldSimilarity], result of:",
"value": 0.076713204,
"details": [
{
"description": "fieldWeight in 0, product of:",
"value": 0.076713204,
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"value": 1,
"details": [
{
"description": "termFreq=1.0",
"value": 1
}
]
},
{
"description": "idf(docFreq=1, maxDocs=1)",
"value": 0.30685282
},
{
"description": "fieldNorm(doc=0)",
"value": 0.25,
}
]
}
]
}
You could then filter on the description field for term frequency and look for a value > 1.
I believe you may be able to do this directly (no client side filtering) by using scripting, as you can get reference to term frequency:
Term statistics:
Term statistics for a field can be accessed with a subscript operator like this: _index['FIELD']['TERM']. This will never return null, even if term or field does not exist. If you do not need the term frequency, call _index['FIELD'].get('TERM', 0) to avoid uneccesary initialization of the frequencies. The flag will have only affect is your set the index_options to docs (see mapping documentation).
_index['FIELD']['TERM'].df()
df of term TERM in field FIELD. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].ttf()
The sum of term frequencys of term TERM in field FIELD over all documents. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].tf()
tf of term TERM in field FIELD. Will be 0 if the term is not present in the current document.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html
However, I've not done this and there are the normal concerns about both security and performance when using server side scripting.

Resources