Elastic search apply boost based on nested field value - elasticsearch

Below is my indexed document
{
"defaultBoostValue":1.01,
"boostDetails": [
{
"Type": "Type1",
"value": 1.0001
},
{
"Type": "Type2",
"value": 1.002
},
{
"Type": "Type3",
"value": 1.0005
}
]
}
i want to apply boost based on value passed, so suppose i pass Type 1 then boost applied will be 1.0001 and if that Type1 does not exist then it will use defaultBoostValue
below is my query which works but quite slow, is there any way to optimize it further
Original question
Above query works but is slow as we are using _source
{
"query": {
"function_score": {
"boost_mode": "multiply",
"functions": [
"script_score": {
"script": {
"source": """
double findBoost(Map params_copy) {
for (def group : params_copy._source.boostDetails) {
if (group['Type'] == params_copy.preferredBoostType ) {
return group['value'];
}
}
return params_copy._source['defaultBoostValue'];
}
return findBoost(params)
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
I have removed the condition of not having dynamic mapping, if changing the structure of boostDetails mapping can help then I am ok but please explain how it can help and be faster to query also please give mapping types and modified structure if answer contains modifying mapping.

Using dynamic mappings (lots of fields)
It looks like you adjusted the doc structure compared to your original question.
The query above was thought for nested fields which cannot be easily iterated in a script for performance reasons. Having said that, the above is an even slower workaround which accesses the docs' _source and iterates its contents. But keep in mind that it's not recommended to access the _source in scripts!
If your docs aren't nested anymore, you can access the so-called doc values which are much more optimized for query-time access:
{
"query": {
"function_score": {
...
"functions": [
{
...
"script_score": {
"script": {
"lang": "painless",
"source": """
try {
if (doc['boost.boostType.keyword'].value == params.preferredBoostType) {
return doc['boost.boostFactor'].value;
} else {
throw new Exception();
}
} catch(Exception e) {
return doc['fallbackBoostFactor'].value;
}
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
thus speeding up your function score query.
Alternative using an ordered list of values
Since the nested iteration is slow and dynamic mappings are blowing up your index, you could store your boosts in a standardized ordered list in each document:
"boostValues": [1.0001, 1.002, 1.0005, ..., 1.1]
and keep track of the corresponding boost types' order in the backend where you construct the queries:
var boostTypes = ["Type1", "Type2", "Type3", ..., "TypeN"]
So something like n-hot vectors.
Then, as you construct the Elasticsearch query, you'd look up the array index of the boostValues based on the boostType and pass this array index to the script query from above which'd access the corresponding boostValues doc-value.
This is guaranteed to be faster than _source access. But it's required that you always keep your boostTypes and boostValues in sync -- preferably append-only (as you add new boostTypes, the list grows in one dimension).

Related

How to compare two date fields in same document in elasticsearch

In my elastic search index, each document will have two date fields createdDate and modifiedDate. I'm trying to add a filter in kibana to fetch the documents where the modifiedDate is greater than createdDate. How to create this filter in kibana?
Tried Using below query instead of greater than it is considering as gte and fetching all records
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script" : {
"inline" : "doc['modifiedTime'].value.getMillis() > doc['createdTime'].value.getMillis()",
"lang" : "painless"
}
}
}
}
}
}
There are a few options.
Option A: The easiest and most performant one is to store the difference of the two fields inside a new field of your document, e.g.
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:34:56Z",
"diffMillis": 0
}
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:35:58",
"diffMillis": 62000
}
Then, in Kibana you can query on diffMillis > 0 and figure out all documents that have been modified after their creation.
Option B: You can use a script query
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": """
return doc['createdDate'].value.millis < doc['modifiedDate'].value.millis;
"""
}
}
}
}
}
Note: depending on the amount of data you have, this option can potentially have disastrous performance, because it needs to be evaluated on ALL of your documents.
Option C: If you're using ES 7.11+, you can use runtime fields directly from the Kibana Discover view.
You can use the following script in order to add a new runtime field (e.g. name it diffMillis) to your index pattern:
emit(doc['modifiedDate'].value.millis - doc['createdDate'].value.millis)
And then you can add the following query into your search bar
diffMillis > 0

ElasticSearch query returns wrong results

I'm relatively new to ElasticSearch and encountered this issue which I can't seem to get why.
So for this particular field, it seems to be treating all the values to be zero, even though the individual records are non-zero values. This only seems to happen to this number field and not other similar fields (such as cpu pct, mem pct etc)
The records only show when I query for records that have 'system.filesystem.used.pct == 0', whereas none of them show when I do something like 'system.filesystem.used.pct > 0'.
I also did the querying in the dev tools in kibana like so, yet I don't get any results:
GET metricbeat-*/_search{
"query": {
"range":{
"system.filesystem.used.pct":{
"gt":0
}
}
}
}
However, if I did this, I will get all non-zero results, just like in discover:
GET metricbeat-*/_search
{
"query": {
"term": {
"system.filesytem.used.pct":0
}
}
}
As pointed out by #Ron Serruya, there is a mapping issue. The mapping for system.filesytem.used.pct is detected as to be of integer type. Since, you are getting the expected search results for cpu.pct field, the mapping of cpu.pct, must have been of float type
CASE 1:
If you index the two sample data as (in the same order)
{
"count": 0.45
}
{
"count": 0
}
Then float data type is detected by elasticsearch (if you are using dynamic mapping). this is because the detection of the field type depends on the first data that you have inserted in the field.
CASE 2:
Now, if you index the data in this order
{
"count": 0
}
{
"count": 0.45
}
Here elasticsearch will detect count to be of long data type.
You need to recreate the index, with the new index mapping, reindex the data and then run the search query on system.filesytem.used.pct
Modified index mapping will be
{
"mappings": {
"properties": {
"system": {
"properties": {
"filesytem": {
"properties": {
"used": {
"properties": {
"pct": {
"type": "float"
}
}
}
}
}
}
}
}
}
}

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

How do I search within an list of strings in Elastic Search?

My data has a field localities which is an array of strings.
"localities": [
"Mayur Vihar Phase 1",
"Paschim Vihar",
"Rohini",
"",
"Laxmi Nagar",
"Vasant Vihar",
"Dwarka",
"Karol Bagh",
"Inderlok" ]
What query should I write to filter the documents by a specific locality such as "Rohini"?
A simple match query will be enough (if you don't know the mapping of your localities field).
POST <your index>/_search
{
"query": {
"match": {
"localities": "Rohini"
}
}
}
If the localities field is set as a string type and index as not_analyzed, the best way to query this is to use a term filter, wrapped in a filtered query (you can't use directly filters) :
POST <your index>/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"localities": "Rohini"
}
}
}
}
}
If you doesn't need the score, the second solution is the way to go as filters doesn't compute score, are faster and cached.
Check the documentation for information about analysis which is a very important subject in ElasticSearch, heavily influencing the way you query.
POST /_search
{
"query": {
"match": {
"localities": "Rohini"
}
}
}
Or you can simply query:
GET /_search?q=localities:Rohini

Elasticsearch script_score with array

I am trying to use script_score to update the score based on a json of ID values. The score should multiply the original score by the factor listed in params.
"script_score": {
"params": {
"ranking": {
"1": "1.3403946161270142",
"3": "1.3438195884227753"
}
},
"script": "_score * ranking[doc['ID'].value]"
}
I am getting the following error:
nested: QueryParsingException[[index name] script_score the script could not be loaded]; nested: CompileException[[Error: unbalanced braces [ ... ]]\n[Near : {... _score * ranking[doc['ID'].value] ....}]\n ^\n[Line: 1, Column: 29]]; }]"
If I manually specify an ID for example _score * ranking['1'], it works fine. Also if I use the ID directly it works, but not if I use the ID value as the index. I should note that the ID is an integer. Can anyone help me solve this? Additionally, how would this work if the ID isn't in the ranking list? Would it treat it as score='_score'?
You ranking param is not an array but a map. The type used for the ID field should match the type used for the key in your map, and the boost should be a number, not a string.
Here is a document:
{
"ID" : "1"
}
and here is the updated query:
GET /test/_search
{
"query": {
"function_score": {
"functions": [
{
"script_score": {
"params": {
"ranking": {
"1": 1.3403946161270142,
"3": 1.3438195884227753
}
},
"script": "_score * ranking.get(_doc['ID'].value)"
}
}
]
}
}
}
The current script doesn't handle cases where the entry is not within the parama map, case that leads to a NullPointerException.
That said I think this boosting method will not scale as you need to have an entry per document in your params, which is hardly maintainable. Having the rank within each of the document seems better although you'd need to update them every time you want to change it.

Resources