Elasticsearch boosting score for values in array - elasticsearch

I am trying to implement scoring of documents based on certain values stored in array via elasticsearch. For example, if my document contain an array object like this:
Document 1:
{
id: "test",
marks: [{
"classtype" : "x1",
"value": 90
}]
}
Document 2:
{
id: "test2",
marks: [{
"classtype" : "x1",
"value": 50
},{
"classtype" : "x2",
"value": 60
}]
}
I want my output scores to be boosted by choosing boosting factor on basis of "classtype", but applicable on "value".
equivalent code would be:
var boostingfactor = {
"x1" : 1,
"x2" : 10
}
var smartscore = 0;
marks.forEach(function(mark){
return smartscore += mark.value * boostingfactor[mark.classtype];
});
return smartscore;
I have tried elasticsearch query on integer values, but not sure if same can be done for values present in array. I also tried writing scripts in elasticsearch's painless language, but couldnt find right way to filter values based on classtype.
POST /student/_search
{
"query": {
"function_score": {
"script_score" : {
"script" : {
"params": {
"x1": 1,
"x2": 10
},
"source": "params[doc['marks.classtype']] * marks.value"
}
}
}
}
}
Expected result is scoring of 90 (90*1) for sample document 1 and 650 (50*1+60*10) for document 2 but above query fails with exception:
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"params[doc['marks.classtype'].value]",
" ^---- HERE"
],
"script": "params[doc['marks.classtype'].value]",
"lang": "painless"
}
Is it possible to accomplish the result via modifying script?
Elasticsearch version: 7.1.0

I was able to read through array values using following script:
"script_score" : {
"script" : {
"params": {
"x1": 5,
"x2": 10
},
"source": "double sum = 0.0; for (item in params._source.marks) { sum += item.value * params[item.classtype]; } return sum;"
}
}

Related

Elasticsearch conditional sorting by different fields

Let's say that my buisness need is to sort results differently, based on some "external" parameter that i'm passing to query.
Documents are more or less like:
{
"transfer_rate": 2000.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 1, "campaign": 1 },
{ "transfer_rate": 500.00, "identifier": 2, "campaign": 2 },
{ "transfer_rate": 750.00, "identifier": 3, "campaign": 3 },
//...
]
},
{
"transfer_rate": 500.00,
"some_collection": [
{ "transfer_rate": 1000.00, "identifier": 4, "campaign": 1 },
{ "transfer_rate": 2000.00, "identifier": 5, "campaign": 2 },
{ "transfer_rate": 625.00, "identifier": 6, "campaign": 3 },
{ "transfer_rate": 225.00, "identifier": 7, "campaign": 1 },
//...
]
}
And now i do have my "parameter", let's say, that's equal to 750.00.
Now, i would like to order this set of documents differently, depends on how different root's transfer_rate is compared to given param as follows:
If doc['transfer_rate'] >= _param then sort by doc['transfer_rate'], else sort by MIN of doc['some_collection'].transfer_rate.
I know that there could be some document optimisations done, but i wasn't inventing this model, nor i'm allowed to change or re-index.
The tricky part about nested objects is, that they do contain property (in given example it's campaign) that has to match criteria, so basically:
When doc['transfer_rate'] is LT than _param_, order by minimum value of doc['some_collection'].transfer_rate where campaign equals to XYZ
So, for given example, with given parameter, documents like first one, should be ordered by doc['transfer_rate'] and documents like second one, should be ordered by nested.
Thanks for any advices / links / support
This is going to be a pain if you can not reindex the data.
I came up with this query
GET /71095886/_search
{
"query": {
"nested": {
"path": "some_collection",
"query": {
"match": {
"some_collection.campaign": 1
}
}
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": """
if (doc['transfer_rate'].value >= params.factor){
return doc['transfer_rate'].value;
} else {
def min = 10000;
for (item in doc['some_collection']){
if (item['transfer_rate'] < min){
min = item['transfer_rate'];
}
}
return min;
}
""",
"params": {
"factor": 2000
}
},
"order": "asc"
}
}
}
But it won't work because of the nested object, and how it is stored in Elastic (actually Lucene, but let's not get down that road .. yet)
If you add "nested_path" : "some_collection" in _script you won't have access to the global transfer_rate anymore (because stored in a different Lucene documents).
Maybe on thing you can look into is runtime fields

How to get last entry for each distinct value of a field in Grafana with an Elasticsearch data source

I have an elasticsearch index with documents like these :
{
"_source": {
"category": 1,
"value": 10,
"utctimestamp": "2020-10-21T15:32:00.000+00:00"
}
}
In Grafana, I'm able to retrive the value of the most recent event with the following query:
Now, I would like to get the MAX value of the most recent documents for each distinct value of category in the given time range.
This means that if I have the 3 following documents in my index :
{
"_source": {
"category": 1,
"value": 10,
"utctimestamp": "2020-10-21T10:30:00"
}
},
{
"_source": {
"category": 2,
"value": 20,
"utctimestamp": "2020-10-21T10:20:00"
}
},
{
"_source": {
"category": 2,
"value": 30,
"utctimestamp": "2020-10-21T10:10:00"
}
}
I would like the query to return the value MAX(10, 20) which is 20. Because the last document for category 1 has the value 10, and the last document for category 2 has the value 20. (If there were a 3rd category, its last value should also be included in the MAX).
Is it possible ?
Thanks to #val for his brilliant query in Sum over top_hits aggregation, your query would be something like this:
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category",
"size": 10
},
"aggs": {
"latest_quantity": {
"scripted_metric": {
"init_script": "params._agg.quantities = new TreeMap()",
"map_script": "params._agg.quantities.put(doc.utctimestamp.date, [doc.utctimestamp.date.millis, doc.value.value])",
"combine_script": "return params._agg.quantities.lastEntry().getValue()",
"reduce_script": "def maxkey = 0; def qty = 0; for (a in params._aggs) {def currentKey = a[0]; if (currentKey > maxkey) {maxkey = currentKey; qty = a[1]} } return qty;"
}
}
}
},
"max_quantities": {
"max_bucket": {
"buckets_path": "category>latest_quantity.value"
}
}
}
}
I ended up creating a middleware service with a REST API between Elasticsearch and Grafana that can make all the custom requests to Elasticsearch (like the request given in the answer of #saeednasehi), and I query the middleware from Grafana with the JSON data source plugin

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Query_string in combination with function_score always gives score 1.0

When I am trying to make a query_string request to my Elasticsearch that uses a function_score (script_score) to manipulate its default score. But I always seem to get a base _score of 1.0.
My model looks like this:
{
"name": "Secret Birthday Party",
"description": "SECRET! Discuss with discretion",
"_userCounters": [
{
"user": "king",
"count": 12
},
{
"user": "joseph",
"count": 1
}
]
}
My request with the function_score script looks like this:
{
"query" : {
"function_score" : {
"query": {
"query_string": {
"query": "secret",
"analyze_wildcard": true,
"fields": [
"name", "description"
]
}
},
"script_score": {
"script": {
"inline" : "int scoreBoost = 1; for (int i = 0; i < params['_source']['_userCounters'].length; i++) { if (params['_source']['_userCounters'][i].user == 'joseph') { scoreBoost += params['_source']['_userCounters'][i].count; } } return scoreBoost;"
}
}
}
}
}
What I am getting is a result which finds exactly what I want, but only returns the value from the function_score script. The built-in scoring does not seem to work anymore. This is the response I am getting:
{
"_index": "test3",
"_type": "projects",
"_id": "7",
"_score": 2, // this is exactly the return value of the script_score. What I want instead is that this value gets multiplied with the normal score of ES
"_source": {
"name": "Secret Birthday Party",
"description": "SECRET! Discuss with discretion",
"_userCounters": [
{
"user": "queen",
"count": 12
},
{
"user": "daniel",
"count": 1
}
]
}
}
My guess is that my request body is not in the correct format since all scores are just 1.0 when I take the function_score out completely.
I figured it out. It was actually a problem with the script itself and not with the structure of the request body.
The function was only returning the factor which is supposed to be multiplied with the _score value. Instead it needs to do the multiplication itself.
Here is the script a bit more readibale:
int scoreBoost = 1;
for (int i = 0; i < params['_source']['_userData'].length; i++) {
if (params['_source']['_userData'][i].user == '{userId}') {
scoreBoost += params['_source']['_userData'][i].count;
}
}
// the error was here: only the scoreBoost value was returned
// the fix is to multiply it with the _score value
return _score * scoreBoost;

ElasticSearch - Boosting based on depth in a recursive structure

I am using Elastic search 2.4.4(compatible with spring boot 1.5.2).
I have a document object which has the following structure :
{
id : 1,
title : Doc title
//some more metadata
sections :[
{
"id" : 2,
"title: Sec title 1,
sections:[...]
},{
id : 3,
title: Sec title 2,
sections:[...]
}
]
}
Basically I want to make the titles in the document searchable(all document title, section titles and subsection titles at any level) and I want to be able to score the documents based on the level at which they match in the tree hierarchy.
My initial thought was using some strcture like this :
{
titles:[
{
title : doc title,
depth : 0
},
{
title : sec title 1,
depth : 1
},
{
title : sec title 2,
depth : 1
},
......
]
}
I would like to rank the documents based on the depth at which there is match(higher the depth, lower is the score).
I know the basic boosting based on the field but,
is there a way can do this in elastic search?
OR
Is it possible to do it by changing the structure?
Yes, you can achieve this by indexing documents in your modified format (a flat array of objects) using a Nested datatype mapping and a Function Score Query inside of a Nested Query:
PUT someindex
{
"mappings": {"sometype":{"properties": {"titles":{"type": "nested"}}}}
}
POST someindex/sometype/0
{
"titles": [
{ "title": "doc title", "depth": 0 },
{ "title": "sec title 1", "depth": 1 },
{ "title": "sec title 2", "depth": 1 }
]
}
POST someindex/sometype/1
{
"titles": [
{ "title": "sec doc title", "depth": 0 }
]
}
GET someindex/sometype/_search
{
"query": {
"nested": {
"path": "titles",
"score_mode": "max",
"query": {
"function_score": {
"query": {
"match": {
"titles.title": "sec"
}
},
"functions": [
{
"exp": {
"titles.depth": {
"origin": 0,
"scale": 1
}
}
}
]
}
}
}
}
}
In this example, document 1 is scored higher because it has a title matching sec at depth 0, whereas document 2 only has a title matching sec at depth 1.
The nested datatype and query ensure that the function_score associates the matching title with its depth, and the function score exp prioritizes titles with lower depth.

Resources