Aggregations on nested documents with painless scripting - elasticsearch

USING ELASTIC SEARCH 6.2
So I have a deeply nested document structure which has all the proper mapping (nested, text, keyword, etc). A sample document is as follows:
{
"type": "Certain Type",
"lineItems": [
{
"lineValue": 10,
"events": [
{
"name": "CREATED",
"timeStamp": "TIME VALUE"
},
{
"name": "ENDED",
"timeStamp": "TIME VALUE"
}
]
}
]
}
What I want to do is find out the average time required for all lines to go from CREATED to ENDED.
I created the following query
GET /_search
{
"size": 0,
"query": {
"match": {
"type": "Certain Type"
}
},
"aggs": {
"avg time": {
"nested": {
"path": "lineItems.events"
},
"aggs": {
"avg time": {
"avg": {
"script": {
"lang": "painless",
"source": """
long timeDiff = 0;
long fromTime = 0;
long toTime = 0;
if(doc['lineItems.events.name.keyword'] == "CREATED"){
fromTime = doc['lineItems.events.timeValue'].value.getMillis();
}
else if(doc['lineItems.events.name.keyword'] == "ENDED"){
toTime = doc['lineItems.events.timeValue'].value.getMillis();
}
timeDiff = toTime-fromTime;
return (timeDiff)
"""
}
}
}
}
}
}
}
The Result was that I got 0 as the aggregation result which is wrong.
Is there any way to achieve this?

Use doc[ in nested object script does not work as nested are a new document for elastic search.
Use params._source instead (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html). Note access to source would be really slow, if you have a lot of documents or if you need to request this query a lot, consider add this field on main document.
I consider all value exist, add if robustness test if needed, this should work.
long toTime = 0;
long fromTime = 0;
timeDiff = params['_source']['ENDED']
fromTime = params['_source']['CREATED']
return (toTime - fromTime);

Related

ElasticSearch Filter by sum of nested documents

I am trying to filter products where a sum of properties in the nested filtered objects is in some range.
I have the following mapping:
{
"product": {
"properties": {
"warehouses": {
"type": "nested",
"properties": {
"stock_level": {
"type": "integer"
}
}
}
}
}
}
Example data:
{
"id": 1,
"warehouses": [
{
"id": 2001,
"stock_level": 5
},
{
"id": 2002,
"stock_level": 0
},
{
"id": 2003,
"stock_level": 2
}
]
}
In ElasticSearch 5.6 I used to do this:
GET products/_search
{
"query": {
"bool": {
"filter": [
[
{
"script": {
"script": {
"source": """
int total = 0;
for (def warehouse: params['_source']['warehouses']) {
if (params.warehouse_ids == null || params.warehouse_ids.contains(warehouse.id)) {
total += warehouse.stock_level;
}
}
boolean gte = true;
boolean lte = true;
if (params.gte != null) {
gte = (total >= params.gte);
}
if (params.lte != null) {
lte = (total <= params.lte);
}
return (gte && lte);
""",
"lang": "painless",
"params": {
"gte": 4
}
}
}
}
]
]
}
}
}
The problem is that params['_source']['warehouses'] no longer works in ES 6.8, and I am unable to find a way to access nested documents in the script.
I have tried:
doc['warehouses'] - returns error (“No field found for [warehouses] in mapping with types []" )
ctx._source.warehouses - “Variable [ctx] is not defined.”
I have also tried to use scripted_field but it seems that scripted fields are getting calculated on the very last stage and are not available during query.
I also have a sorting by the same logic (sort products by the sum of stocks in the given warehouses), and it works like a charm:
"sort": {
"warehouses.stock_level": {
"order": "desc",
"mode": "sum",
"nested": {
"path": "warehouses"
"filter": {
"terms": {
"warehouses.id": [2001, 2003]
}
}
}
}
}
But I can't find a way to access this sort value either :(
Any ideas how can I achieve this? Thanks.
I recently had the same issue. It turns out the change occurred somewhere around 6.4 during refactoring and while accessing _source is strongly discouraged, it looks like people are still using / wanting to use it.
Here's a workaround taking advantage of the include_in_root parameter.
Adjust your mapping
PUT product
{
"mappings": {
"properties": {
"warehouses": {
"type": "nested",
"include_in_root": true, <--
"properties": {
"stock_level": {
"type": "integer"
}
}
}
}
}
}
Drop & reindex
Reconstruct the individual warehouse items in a for loop while accessing the flattened values:
GET product/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": """
int total = 0;
def ids = doc['warehouses.id'];
def levels = doc['warehouses.stock_level'];
for (def i = 0; i < ids.length; i++) {
def warehouse = ['id':ids[i], 'stock_level':levels[i]];
if (params.warehouse_ids == null || params.warehouse_ids.contains(warehouse.id)) {
total += warehouse.stock_level;
}
}
boolean gte = true;
boolean lte = true;
if (params.gte != null) {
gte = (total >= params.gte);
}
if (params.lte != null) {
lte = (total <= params.lte);
}
return (gte && lte);
""",
"lang": "painless",
"params": {
"gte": 4
}
}
}
}
]
}
}
}
Be aware that this approach assumes that all warehouses include a non-null id and stock level.

Compute percentile with collapsing by user

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}
Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.
Here is my object:
{
"name": "name",
"identificationHash": "aslkdakldjka",
"isDupe": false,
...
}
My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.
Currently what I'm doing is:
I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
"query": {
"bool": {
"must": [
{
"term": {
"isDupe": {
"value": false,
"boost": 1
}
}
}
]
}
},
"aggregations": {
"identificationHashCount": {
"terms": {
"field": "identificationHash",
"size": 10000,
"min_doc_count": 2
}
}
}
}
With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.

How to get the latest value for the bucket in Elasticsearch?

I have a bunch of documents with just count field.
I'm trying to get the latest value for that field aggregated by date:
{
"query": {
"match_all": {}
},
"sort": "_timestamp",
"aggs": {
"result": {
"date_histogram": {
"field": "_timestamp",
"interval": "day",
"min_doc_count": 0
},
"aggs": {
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"last_value": 0
}
},
"map_script": "_agg.last_value = doc['count'].value",
"reduce_script": "return _aggs.last().last_value"
}
}
}
}
}
}
But the problem here is that documents fall into last_value aggregation not sorted by _timestamp, so I can't guarantee that the last value is really the last value.
So, my questions:
Is it possible to sort data by _timestamp when performing last_value aggregation?
Is there any better way to get the last value aggregated by day?
Looks like it is possible to tune scripted_metric aggregations a little bit to solve the first part of the question (sorting by _timestamp):
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"value": 0,
"timestamp": 0
}
},
"map_script": "_agg.value = doc['count'].value; _agg.timestamp = doc['_timestamp'].value",
"reduce_script": "value = 0; timestamp=0; for (a in _aggs) { if(a.timestamp > timestamp){ value = a.value; timestamp = a.timestamp} }; return value;"
}
}
But I continue to doubt that this is the best way to solve that

How to subtract aggregate min from aggreagate max(difference) in ES?

How to write an ES query to find the difference between max and min value of a field?
I am a newbee in elastic search,
In my case I feed lot of events along with session_id and time in to elastic search.
My event structure is
Event_name string `json:"Event_name"`
Client_id string `json:"Client_id"`
App_id string `json:"App_id"`
Session_id string `json:"Session_id"`
User_id string `json:"User_id"`
Ip_address string `json:"Ip_address"`
Latitude int64 `json:"Latitude"`
Longitude int64 `json:"Longitude"`
Event_time time.Time `json:"Time"`
I want to find the life time of a session_id based the feeded events.
For that I can retrive the maximum Event_time and minimum Event_time for a particular session_id by the following ES query.
{
"size": 0,
"query": {
"match": {
"Session_id": "dummySessionId"
}
},
"aggs": {
"max_time": {
"max": {
"field": "Time"
}
},
"min_time":{
"min": {
"field": "Time"
}
}
}
}
But what I exact want is (max_time - min_time)
How to write the ES query for the same????
Up to elasticsearch 1.1.1, this is not possible to do any arithmetic operation upon two aggregate function's reasult from elasticsearch side.
If you want then, you should do that from client side.
That is neither possible through scripts, as #eliasah suggests.
In the upcoming versions they may be add such facility.
in the 1.5.1 using the Scripted Metric Aggregation you can do this. Not sure about the performance, but it looks to work. This functionality is experimental and may be changed or removed completely in a future release.
POST test_time
POST test_time/data/1
{"Session_id":1234,"Event_time":"2014-01-01T12:00:00"}
POST test_time/data/3
{"Session_id":1234,"Event_time":"2014-01-01T14:00:00"}
GET /test_time/_search
{
"size": 0,
"aggs": {
"by_user": {
"terms": {
"field": "Session_id"
},
"aggs": {
"session_lenght_sec": {
"scripted_metric": {
"map_script": "_agg['v'] = doc['Event_time'].value",
"reduce_script": "min = null; max = null; for (a in _aggs) {if (min == null || a.v < min) { min = a.v}; if (max == null || a.v > max) { max = a.v }}; return (max-min)/1000"
}
}
}
}
}
}
###### RESPONSE #######
{
...,
"aggregations": {
"by_user": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1234,
"doc_count": 2,
"session_lenght_sec": {
"value": "7200"
}
}
]
}
}
}
This answer is bound to the Elasticsearch 7.8 version.
Followed up the #pippobaudos answer ahead. Elasticsearch has made some major changes since the answer.
the aggregation has a type 'scripted_metric' (click on the link to know more), which has new sub-attributes such as init_script, map_script, combine_script, reduce_script. Out of which, only init_script is optional. Following is the modified query.
"aggs": {
"cumulative":{
"scripted_metric": {
"init_script": {
"source": "state.stars = []"
},
"map_script": {
"source": "if (doc.containsKey('star_count')) { state.stars.add(doc['star_count'].value); }"
},
"combine_script": {
"source": "long min=9223372036854775807L,max=-9223372036854775808L; for (a in state.stars) {if ( a < min) { min = a;} if ( a > max) { max = a; }} return (max-min)"
},
"reduce_script": {
"source": "long max = -9223372036854775808L; for (a in states) { if (a != null && a > max){ max=a; } } return max "
}
}
}
}
Giving directly the query will not help you much, so I suggest you read the documentation about Script Fields and Scripting.

Resources