How to take (length of the aliases field) out of score calculation - elasticsearch

Suppose we have a documents of people with their name and array of aliases like this:
{
name: "Christian",
aliases: ["נוצרי", "کریستیان" ]
}
Suppose I have a document with 10 aliases and another one with 2 aliases
but both of them contains alias with value کریستیان.
The length of field (dl) for the first document is bigger than the second document
so the term frequency (tf) of the first document gets lower than the second one. eventually the score of the document with less aliases is bigger than another.
Sometimes I want to add more aliases for person in different languages and different forms because he/she is more famous but it causes to get lower score in results. I want to somehow take length of the aliases field out of my query's calculation.

Norms
store the relative length of the field.
How long is the field? The shorter the field, the higher the weight.
If a term appears in a short field, such as a title field, it is more
likely that the content of that field is about the term than if the
same term appears in a much bigger body field.
Norms can be disabled using PUT mapping api
PUT my_index/_mapping
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}
Links for further study
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm

Related

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

What is the difference between a field and a property in Elasticsearch?

I'm currently trying to understand the difference between fields (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) and properties (https://www.elastic.co/guide/en/elasticsearch/reference/current/properties.html).
They are both somehow defined as a "subfield/subproperty" of a type/mapping property, both can have separate types and analyzers (as far as I understood it), both are accessed by the dot notation (mappingProperty.subField or mappingProperty.property).
The docs are using the terms "field" and "property" randomly, I have the feeling, for example:
Type mappings, object fields and nested fields contain sub-fields,
called properties.
What is the difference between properties and (sub-)fields? How do I decide if I have a property or a field?
In other words, how do I decide if I use
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"properties": {
}
}
}
}
}
}
or
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"fields": {
}
}
}
}
}
}
Subfields are indexed from the parent property source. While sub-properties need to have a "real" value in the document's source.
If your source contains a real object, you need to create properties. Each property will correspond to a different value from your source.
If you only want to index the same value but with different analyzers then use subfields.
It is often useful to index the same field in different ways for
different purposes. This is the purpose of multi-fields. For instance,
a string field could be mapped as a text field for full-text search,
and as a keyword field for sorting or aggregations:
(sorry I find its hard to explain =| )
Note: This is an explanation from my current understanding. It may not be 100% accurate.
A property is what we used to call field in a RDBMS (a standard relationship db like MySQL). It stores properties of an object and provides the high-level structure for an index (which we can compare to a table in a relational DB).
A field, which is linked (or included) into the property concept, is a way to index that property using a specific analyzer.
So lets say you have:
One analyzer (A) to uppercase
One analyzer (B) to lowercase
One analyzer (C) to translate to Spanish (this doesn't even exist, just to give you an idea)
What an analyzer does is transform the input (the text on a property) into a series of tokens that will be indexed. When you do a search the same analyzer is used so the text is transformed into those tokens, it gives each one a score and then those tokens are used to grab documents from the index.
(A) Dog = DOG
(B) Dog = dog
(C) Dog = perro
To search using a specific field configuration you call it using a dot:
The text field uses the standard analyzer.
The text.english field uses the English analyzer.
So the fields basically allow you to perform searches using different token generation models.

Elasticsearch: auto increment integer field across two index

I need a auto increment integer field across two index.
Can Elasticsearch do it automatically like MySQL "auto increment" field in a table?
Eg. when puts some documents in two different index:
POST /my_index_1/blogpost/
{
"title": "Foo Bar"
}
POST /my_index_2/blogpost/
{
"title": "Baz quux"
}
On retrieve it, i want:
GET /my_index_*/blogpost/
{
"uid" : 1,
"title": "Foo Bar"
},
{
"uid" : 2,
"title": "Baz quux"
}
No, ES does not have any auto increment feature since it is a distributed system, figuring out the correct value for the counter is non trivial. Especially since (bulk) indexing tends to be heavily concurrent. You can typically max out CPUs on all nodes if you throw enough documents at it.
So, your best option is to do this outside of ES before you send the documents to ES. Or even better, don't do this. If you need some kind of order of insertion, a better option is to simply use a timestamp. They are actually stored as a number internally. You still might get duplicates of course if two documents get indexed the same millisecond. A trick we've used to work around that is to offset documents indexed at the same time by 1 ms. to ensure we keep the insertion order.

analyzed vs not_analyzed: storage size

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story).
For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).
Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.
I edit for adding the mapping, although I think it has nothing special:
{
"mappings": {
"my_type" : {
"_ttl" : { "enabled" : true, "default" : "7d" },
"properties" : {
"field1" : {
"properties" : {
"field2" : {
"type" : "string", "index" : "not_analyzed"
}
more not_analyzed String fields here
...
...
...
}
not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).
As an example:
(Doc 1) "The quick brown fox jumped over the lazy dog"
(Doc 2) "Lazy like the fox"
Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):
"brown": [1]
"dog": [1]
"fox": [1,2]
"jumped": [1]
"lazy": [1,2]
"over": [1]
"quick": [1]
30 characters worth of string data
Simplified postings list created by "index": "not_analyzed":
"The quick brown fox jumped over the lazy dog": [1]
"Lazy like the fox": [2]
62 characters worth of string data
Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.
But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.
From the documentation, it looks like not_analyzed makes the field act like a "keyword" instead of a "full-text" field -- let's compare these two!
Full text
These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
Keyword
Keyword fields are not_analyzed. Instead, the exact string value is added to the index as a single term.
I'm not surprised that storing an entire string as a term, rather than breaking it into a list of terms, doesn't necessarily translate to saved space. Honestly, it probably depends on the index's analyzer and the string being indexed.
As a side note, I just re-indexed about a million documents of production data and cut our index disk space usage by ~95%. The main difference I made was modifying what was actually saved in the source (AKA stored). We indexed PDFs for searching, but did not need them to be returned and so that saved us from saving this information in two different ways (analyzed and raw). There are some very real downsides to this, though, so be careful!
Doc1{
"name":"my name is mayank kumar"
}
Doc2.{
"name":"mayank"
}
Doc3.{
"name":"Mayank"
}
We have 3 documents.
So if field 'name' is 'not_analyzed' and we search for 'mayank' only second document would be returned.If we search for 'Mayank' only third document would be returned.
If field 'name' is 'analyzed' by a analyser 'lowercase analyser'(just as a example).We we search for 'mayank', all 3 documents would be returned.
If we search for 'kumar' ,first document would be returned.This happens because in first document the field value gets tokenised as "my" "name" "is" "mayank" "kumar"
'not_analyzed' is basically used for 'full-text' search(mostly except in wildcards matching).less space on disk.Takes less time during indexing.
'analyzed' is basically used for matching documents.more space on disk (if the analyze fields are big).Takes more time during indexing.(More fields due to analyze fields)

tf/idf boosting within field

My use case is like this:
for a query iphone charger, I am getting higher relevance for results, having name, iphone charger coupons than with name iphone charger, possibly because of better match in description and other fields. Boosting name field isn't helping much unless I skew the importance drastically. what I really need is tf/idf boost within name field
to quote elasticsearch blog:
the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way. It will treat all title fields (because they are short) as more important than all body fields (because they are long).
I need to boost this more important value for a particular field. Can we do this with function score or any other way?
A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). If there are hits on other fields, you have a lot of scoring elements to fight against.
A dis_max would probably be a reasonable approach to this. Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. If you boost the query against title, you can ensure matches there are strongly preferred.
You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied.
{
"dis_max" : {
"tie_breaker" : 0.2,
"queries" : [
{
"terms" : {
"age" : ["iphone", "charger"],
"boost" : 10
}
},
{
"terms" : {
"description" : ["iphone", "charger"]
}
}
]
}
}
Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field, is to separately index an untokenized version of that field, and query that field as well. Any match against the untokenized version of the field will be an exact match again the entire field contents. This would prevent you needing to relying on the length norm to make that determination.

Resources