How to display only the key from the bucket - elasticsearch

I have an index with millions of documents. Suppose each of my documents has some code, and I need to find the list of codes matching some criteria. The only way I found doing that, is using whole lot of aggregations, so I created an ugly query which does exactly what I want:
POST my-index/_search
{
"query": {
"range": {
"timestamp": {
"gte": "2017-08-01T00:00:00.000",
"lt": "2017-08-08T00:00:00.000"
}
}
},
"size": 0,
"aggs": {
"codes": {
"terms": {
"field": "code",
"size": 10000
},
"aggs": {
"days": {
"date_histogram": {
"field": "timestamp",
"interval": "day",
"format": "dd"
},
"aggs": {
"hours": {
"date_histogram": {
"field": "timestamp",
"interval": "hour",
"format": "yyyy-MM-dd:HH"
},
"aggs": {
"hour_income": {
"sum": {
"field": "price"
}
}
}
},
"max_income": {
"max_bucket": {
"buckets_path": "hours>hour_income"
}
},
"day_income": {
"sum_bucket": {
"buckets_path": "hours.hour_income"
}
},
"more_than_sixty_percent": {
"bucket_script": {
"buckets_path": {
"dayIncome": "day_income",
"maxIncome": "max_income"
},
"script": "params.maxIncome - params.dayIncome * 60 / 100 > 0 ? 1 : 0"
}
}
}
},
"amount_of_days": {
"sum_bucket": {
"buckets_path": "days.more_than_sixty_percent"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"amountOfDays": "amount_of_days"
},
"script": "params.amountOfDays >= 3"
}
}
}
}
}
}
The response I get is a few millions lines of JSON, consisting of buckets. Each bucket has more than 700 lines (and buckets of its own), but all I need is its key, so that I have my list of codes. I guess it's not good having a response a few thousand times larger than neccessary, and there might be problems with parsing. So I wanted to ask, is there any way to hide the other info in the bucket and get only the keys?
Thanks.

Related

Elasticsearch: How set 'doc_count' of a FILTER-Aggregation in relation to total 'doc_count'

A seemingly very trivial problem prompted me today to read the Elasticsearch documentation again diligently. So far, however, I have not come across the solution....
Question:
is ther's a simple way to set the doc_count of a filter aggregation in relation to the total doc_count?
Here's a snippet from my search-request-json.
In the feature_occurrences aggregation I filtered documents.
Now I want to calculate the ratio filtered/all Docs in each time bucket.
GET my_index/_search
{
"aggs": {
"time_buckets": {
"date_histogram": {
"field": "date",
"calendar_interval": "1d",
"min_doc_count": 0
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"x": "y"
}
}
},
"feature_occurrences_per_doc" : {
// feature_occurences.doc_count / doc_count
}
Any Ideas ?
You can use bucket_script to calc the ratio:
{
"aggs": {
"date": {
"date_histogram": {
"field": "#timestamp",
"interval": "hour"
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"cloud.region": "westeurope"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"doc_count": "_count",
"features_count": "feature_occurrences._count"
},
"script": "params.features_count / params.doc_count"
}
}
}
}
}
}
Elastic bucket script doc:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

Percentile based filtering elastic search

I'm trying to calculate the 15th and 75th percentiles on an aggregrated dervied field(latency) -> and trying to retrieve those records with field value > (p75-p15). I am able to calculate the aggs and the thresholds but unable to filter out the required values. Tried the below query and am running into "buckets_path must reference either a number value or a single value numeric metric aggregation, got: java.lang.Object[]". I'm just trying to retrieve records with average latency > threshold. Any pointers?
"aggs": {
"by_name": {
"terms": {
"script": "doc['name'].value + ',' + doc['valf'].value ,
"size": 5000
},
"aggs": {
"single_round_block": {
"date_histogram": {
"field": "start_time",
"interval": "300s"
},
"aggs": {
"overallSumLatency": {
"sum": {
"field": "sum_latency_ms"
}
},
"overallNumLatencyMeasurements": {
"sum": {
"field": "num_valid_latency_measurements"
}
},
"avgLatency": {
"bucket_script": {
"buckets_path": {
"sumLatency": "overallSumLatency",
"numPoints": "overallNumLatencyMeasurements"
},
"script": "(params.numPoints == 0)?0:(params.sumLatency / params.numPoints)"
}
}
}
},
"percentiles_vals": {
"percentiles_bucket": {
"buckets_path": "single_round_block>avgLatency",
"percents": [ 15.0,75.0]
}
},
"threshold":{
"bucket_script": {
"buckets_path": {
"perc75":"percentiles_vals[75.0]",
"perc15":"percentiles_vals[15.0]"
},
"script": "Math.abs(params.perc75 - params.perc15)"
}
},
"filter_out_records": {
"bucket_selector": {
"buckets_path": {
"threshold":"threshold",
"avgLatency":"single_round_block>avgLatency"
},
"script": "params.avgLatency > params.threshold"
}
}
}
}
}
}

ElasticSearch: Query syntax is painful

I just have started working on ElasticSearch and it is painful to write in Painless. So difficult to see the connections between brackets, too many spaces. I am working on the outlier detection and as an example, this is what the code looks like:
"query": {
"filtered": {
"filter": {
"range": {
"hour": {
"gte": "{{start}}",
"lte": "{{end}}"
}
}
}
}
},
"size": 0,
"aggs": {
"metrics": {
"terms": {
"field": "metric",
"size": 5
},
"aggs": {
"queries": {
"terms": {
"field": "query",
"size": 500
},
"aggs": {
"series": {
"date_histogram": {
"field": "hour",
"interval": "hour"
},
"aggs": {
"avg": {
"avg": {
"field": "value"
}
},
"movavg": {
"moving_avg": {
"buckets_path": "avg",
"window": 24,
"model": "simple"
}
},
"surprise": {
"bucket_script": {
"buckets_path": {
"avg": "avg",
"movavg": "movavg"
},
"script": "(avg - movavg).abs()"
}
}
}
},
"largest_surprise": {
"max_bucket": {
"buckets_path": "series.surprise"
}
}
}
},
"ninetieth_surprise": {
"percentiles_bucket": {
"buckets_path": "queries>largest_surprise",
"percents": [
90
]
}
}
}
}
I solve it by creating my own convention for the code in order for it to be readable. It is based only on the closing parenthesis and the indentation helps in readability. It just opens a new line whenever it finds a closing brackets' group (except the ones inline like "{{start}}") It is something like this:
{
"query":{"filtered":{"filter":{"range":{"hour":{"gte":"{{start}}","lte":"{{end}}"}}}}},
"size":0,
"aggs":{"metrics":{"terms":{"field":"metric",“size”:5},
"aggs":{"queries":{"terms":{"field":"query","size":500},
"aggs":{"series": {"date_histogram":{"field":"hour","interval":"hour"},
"aggs":{"avg":{"avg":{"field":"value"}},
....
I would love to know whether there is any other convention which helps in readability and to follow the lines of code. What is being used in the community?
CODE from: https://www.elastic.co/blog/implementing-a-statistical-anomaly-detector-part-1

How to use cumulative_sum with a previous aggregation?

I would like to plot a cumulative sum of some events, per day. The cumulative sum aggregation seems to be the way to go so I tried to reuse the example given in the docs.
The first aggregation works fine, the following query
{
"aggs": {
"vulns_day" : {
"date_histogram" :{
"field": "HOST_START_iso",
"interval": "day"
}
}
}
}
gives replies such as
(...)
{
"key_as_string": "2016-09-08T00:00:00.000Z",
"key": 1473292800000,
"doc_count": 76330
},
{
"key_as_string": "2016-09-09T00:00:00.000Z",
"key": 1473379200000,
"doc_count": 37712
},
(...)
I then wanted to query the cumulative sum of doc_count above via
{
"aggs": {
"vulns_day" : {
"date_histogram" :{
"field": "HOST_START_iso",
"interval": "day"
}
},
"aggs": {
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "doc_count"
}
}
}
}
}
but it gives an error:
"reason": {
"type": "search_parse_exception",
"reason": "Could not find aggregator type [vulns_cumulated] in [aggs]",
I see that bucket_path should point to the elements to be summed and the example for cumulative aggregations created a specific intermediate sum but I do not have anything to sum (beside doc_count).
I guess, you should change your query like this:
{
"aggs": {
"vulns_day": {
"date_histogram": {
"field": "HOST_START_iso",
"interval": "day"
},
"aggs": {
"document_count": {
"value_count": {
"field": "HOST_START_iso"
}
},
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "document_count"
}
}
}
}
}
}
I found the solution. Since doc_count did not seem to be available, I tried to retrieve stats for the time parameter, and use its count value. It worked:
{
"size": 0,
"aggs": {
"vulns_day": {
"date_histogram": {
"field": "HOST_START_iso",
"interval": "day"
},
"aggs": {
"dates_stats": {
"stats": {
"field": "HOST_START_iso"
}
},
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "dates_stats.count"
}
}
}
}
}
}

Resources