Elasticsearch sort within sampler - elasticsearch

We are using Elasticsearch 7.*, and I'm trying to take a sample. It returns far more than 10,000 results, which is the max hits a query can return. In order to paginate with search_after, I need to sort the items by #timestamp (_id sorting will be deprecated soon).
Here's my current query:
GET /my-index-pattern/_search
{
"query": {
"range": {
"#timestamp": {
"gte": "now-1M",
"lte": "now"
}
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 40000
},
"aggs": {
"group_by_my_grouping_field": {
"terms": {
"field": "my_grouping_field.keyword",
"size": 10000
}
}
}
}
},
"sort": [
"#timestamp"
]
}
Returning:
"_shards" : {
"total" : 55,
"successful" : 55,
"skipped" : 43,
"failed" : 0
},
However, this takes a long time. I think it's sorting before doing the sample, which also affects my methodology. It's also skipping something?
Is there a way to sort within the sample?
I tried:
...
"sample": {
"sampler": {
"shard_size": 40000
},
"aggs": {
"group_by_my_grouping_field": {
"terms": {
"field": "my_grouping_field.keyword",
"size": 10000
}
},
"search_after_sort":
{
"bucket_sort": {
"sort": ["#timestamp"]
}
}
}
}
...
But this just gives:
"error" : {
"root_cause" : [
{
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: No aggregation found for path [#timestamp];"
}
],
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: No aggregation found for path [#timestamp];"
},
"status" : 400
enter code here
This happens for all fields, like message and _id, not just on #timestamp.

Related

Elasticsearch, composite and sub(?) aggregations

I'm using composite to scroll through whole data. (it's like pagination)
Suppose a car selling data,
For each day, I'd like to count the number of cars sold per car-brand
{
day1: {
honda: 3,
bmw: 5
},
day2: {
honda: 4,
audi: 1,
tesla:5
}
}
I'm doing something like the following but it doesn't work
GET _search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"date": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "1d"
},
"aggs": {
"car_brand": {
"terms": {
"field": "car_brands"
}
}
}
}
}
]
}
}
}
}
with error message
{
"error" : {
"root_cause" : [
{
"type" : "x_content_parse_exception",
"reason" : "[14:17] [composite] failed to parse field [sources]"
}
],
"type" : "x_content_parse_exception",
"reason" : "[14:17] [composite] failed to parse field [sources]",
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "expected value but got [FIELD_NAME]"
}
},
"status" : 400
}
Composite aggs cannot directly accept sub-aggs. Go with
GET _search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"date": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "1d"
}
}
},
{
"car_brand": {
"terms": {
"field": "car_brands"
}
}
}
]
}
}
}
}
instead.

elasticsearch return hits found in aggregation

I am trying to get rows from my database that have a unique 'sku' field.
I have a working query which counts this number correctly, my query:
GET _search
{
"size": 0,
"aggs": {
"unique_products":{
"cardinality":{
"field":"sku.keyword"
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
Returns this result:
{
"took" : 129,
"timed_out" : false,
"_shards" : {
"total" : 514,
"successful" : 514,
"skipped" : 98,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 150,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"unique_products" : {
"value" : 38
}
}
}
Which correctly reports the number of unique_products as 38.
I am trying to edit this query so that it will actually return all 38 unique products, but am unsure how, I started by trying to return the top hit from the agg result:
GET _search
{
"size": 0,
"aggs": {
"unique_products":{
"cardinality":{
"field":"sku.keyword"
}
},
"top_hits": {
"size": 1,
"_source": {
"include": [
"sku", "source_store"
]
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
But got an error in my result saying:
{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Expected [START_OBJECT] under [size], but got a [VALUE_NUMBER] in [top_hits]",
"line": 10,
"col": 13
}
],
"type": "parsing_exception",
"reason": "Expected [START_OBJECT] under [size], but got a [VALUE_NUMBER] in [top_hits]",
"line": 10,
"col": 13
},
"status": 400
}
Is a cardinality agg still my best bet for returning all 38 unique products? thanks
While the cardinality aggregation gives the unique count, it cannot accept sub-aggs. In other words top_hits cannot be used here directly.
The approach was correct but you may first want to bucketize the skus and then retrieve the underlying docs using top_hits:
{
"size": 0,
"aggs": {
"unique_products": {
"cardinality": {
"field": "sku.keyword"
}
},
"terms_agg": {
"terms": {
"field": "sku.keyword",
"size": 100
},
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"sku",
"source_store"
]
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(merch1: 'Dog') AND ((store_name: 'walmart')) AND product_gap: 'yes'"
}
},
{
"range": {
"capture_date": {
"format": "date",
"gte": "2020-05-13",
"lte": "2020-08-03"
}
}
}
]
}
}
}
FYI The reason your query threw an exception is that top_hits is an agg type and, just like unique_products, it was missing its own name.

Elastic script from buckets and higher level aggregation

I want to compare the daily average of a metric (the frequency of words appearing in texts) to the value of a specific day. This is during a week. My goal is to check whether there's a spike. If the last day is way higher than the daily average, I'd trigger an alarm.
So from my input in Elasticsearch I compute the daily average during the week and find out the value for the last day of that week.
For getting the daily average for the week, I simply cut a week's worth of data using a range query on date field, so all my available data is the given week. I compute the sum and divide by 7 for a daily average.
For getting the last day's value, I did a terms aggregation on the date field with descending order and size 1 as suggested in a different question (How to select the last bucket in a date_histogram selector in Elasticsearch)
The whole output is as follows. Here you can see words "rama0" and "rama1" with their corresponding frequencies.
{
"aggregations" : {
"the_keywords" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
{
"key" : "rama1",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
[...]
]
}
}
}
Now I have the_daily_average in a high level of the output, and the_last_day_frequency in the single-element buckets list in the_last_day aggregation. I cannot use a bucket_script to compare those, because I cannot refer to a single bucket (if I place the script outside the_last_day aggregation) and I cannot refer to higher-level aggregations if I place the script inside the_last_day.
IMO the reasonable thing to do would be to put the script outside the aggregation and use a buckets_path using the <AGG_NAME><MULTIBUCKET_KEY> syntax mentioned in the docs, but I have tried "var1": "the_last_day[1580169600000]>the_last_day_frequency" and variations (hardcoding first until it works), but I haven't been able to refer to a particular bucket.
My ultimate goal is to have a list of keywords for which the last day frequency greatly exceeds the daily average.
For anyone interested, my current query is as follows. Notice that the part I'm struggling with is commented out.
body='{
"query": {
"range": {
"date": {
"gte": "START",
"lte": "END"
}
}
},
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average" : {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
}/*,
"the_spike": {
"bucket_script": {
"buckets_path": {
"last_day_frequency": "the_last_day>the_last_day_frequency",
"daily_average": "the_daily_average"
},
"script": {
"inline": "return last_day_frequency / daily_average"
}
}
}*/
}
}
}
}'
In your query the_last_day>the_last_day_frequency points to a bucket not a single value so it is throwing error. You need to get single metric value from "the_last_day_frequency", you can achieve it using max_bucket. Then you can use bucket_Selector aggregation to compare last day value with average value
Query:
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average": {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
},
"max_frequency_last_day": {
"max_bucket": {
"buckets_path": "the_last_day>the_last_day_frequency"
}
},
"the_spike": {
"bucket_selector": {
"buckets_path": {
"last_day_frequency": "max_frequency_last_day",
"daily_average": "the_daily_average"
},
"script": {
"inline": "params.last_day_frequency > params.daily_average"
}
}
}
}
}
}
````

How to aggregate until a certain value is reached in ElasticSearch?

I would like to aggregate a list of documents (each of them has two fields - timestamp and amount) by "amount" field until a certain value is reached. For example I would like to get list of documents sorted by timestamp which total amount is equal to 100. Is it possible to do in one query?
Here is my query which returns total amount - I would like to add here a condition to stop aggregation when a certain value is reached.
{
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": 1525168583
}
}
}
]
}
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
},
"sort": [
"timestamp"
],
"size": 10000
}
Thank You
It's perfectly possible using a combination of function_score scripting for mimicking sorting, filter aggs for the range gte query and a healthy amount of scripted_metric aggs to limit the summation up to a certain amount.
Let's first set up a mapping and ingest some docs:
PUT summation
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
}
}
POST summation/_doc
{
"context": "newest",
"timestamp": 1587049128,
"amount": 20
}
POST summation/_doc
{
"context": "2nd newest",
"timestamp": 1586049128,
"amount": 30
}
POST summation/_doc
{
"context": "3rd newest",
"timestamp": 1585049128,
"amount": 40
}
POST summation/_doc
{
"context": "4th newest",
"timestamp": 1585049128,
"amount": 30
}
Then perform the query:
GET summation/_search
{
"size": 0,
"aggs": {
"filtered_agg": {
"filter": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1585049128
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": "return (params['now'] - doc['timestamp'].date.toMillis())",
"params": {
"now": 1587049676
}
}
}
}
}
]
}
},
"aggs": {
"limited_sum": {
"scripted_metric": {
"init_script": """
state['my_hash'] = new HashMap();
state['my_hash'].put('sum', 0);
state['my_hash'].put('docs', new ArrayList());
""",
"map_script": """
if (state['my_hash']['sum'] <= 100) {
state['my_hash']['sum'] += doc['amount'].value;
state['my_hash']['docs'].add(doc['context.keyword'].value);
}
""",
"combine_script": "return state['my_hash']",
"reduce_script": "return states[0]"
}
}
}
}
}
}
yielding
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"filtered_agg" : {
"meta" : { },
"doc_count" : 4,
"limited_sum" : {
"value" : {
"docs" : [
"newest",
"2nd newest",
"3rd newest",
"4th newest"
],
"sum" : 120
}
}
}
}
}
I've chosen here to only return the doc.contexts but you can adjust it to retrieve whatever you like -- be it IDs, amounts etc.

Elasticsearch sort inside top_hits aggregation

I have an index of messages where I store messageHash for each message too. I also have many more fields along with them. There are multiple duplicate message fields in the index e.g. "Hello". I want to retrieve unique messages.
Here is the query I wrote to search unique messages and sort them by date. I mean the message with the latest date among all duplicates is what I want
to be returned.
{
"query": {
"bool": {
"must": {
"match_phrase": {
"message": "Hello"
}
}
}
},
"sort": [
{
"date": {
"order": "desc"
}
}
],
"aggs": {
"top_messages": {
"terms": {
"field": "messageHash"
},
"aggs": {
"top_messages_hits": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
},
"_score"
],
"size": 1
}
}
}
}
}
}
The problem is that it's not sorted by date. It's sorted by doc_count. I just get the sort values in the response, not the real sorted results. What's wrong? I'm now wondering if it is even possible to do it.
EDIT:
I tried subsituting "terms" : { "field" : "messageHash", "order" : { "mydate" : "desc" } } , "aggs" : { "mydate" : { "max" : { "field" : "date" } } } for "terms": { "field": "messageHash" } but I get:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
}
],
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
},
"status" : 400
}

Resources