Elasticsearch question, How to order by inner aggregate of date_histogram aggregation? - elasticsearch

Thanks for checking the question
my query like below
"size":0,
"aggs": {
"result" : {
"date_histogram": {
"field": "time",
"calendar_interval": "day"
},
"aggs": {
"user": {
"terms": {
"field": "user.number"
},
"aggs" : {
"privacy_types": {
"nested": {
"path": "list"
},
"aggs": {
"totalCnt": {
"sum": {
"field": "list.count"
}
}
}
}
}
}
}
}
}
this is my result
enter image description here
I want to group by date and user.number and sort by totalCnt.
My query is not getting the desired result
how can i get it to work properly?
I'm struggling for 3 days, please help :(

Related

Elasticsearch: How set 'doc_count' of a FILTER-Aggregation in relation to total 'doc_count'

A seemingly very trivial problem prompted me today to read the Elasticsearch documentation again diligently. So far, however, I have not come across the solution....
Question:
is ther's a simple way to set the doc_count of a filter aggregation in relation to the total doc_count?
Here's a snippet from my search-request-json.
In the feature_occurrences aggregation I filtered documents.
Now I want to calculate the ratio filtered/all Docs in each time bucket.
GET my_index/_search
{
"aggs": {
"time_buckets": {
"date_histogram": {
"field": "date",
"calendar_interval": "1d",
"min_doc_count": 0
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"x": "y"
}
}
},
"feature_occurrences_per_doc" : {
// feature_occurences.doc_count / doc_count
}
Any Ideas ?
You can use bucket_script to calc the ratio:
{
"aggs": {
"date": {
"date_histogram": {
"field": "#timestamp",
"interval": "hour"
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"cloud.region": "westeurope"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"doc_count": "_count",
"features_count": "feature_occurrences._count"
},
"script": "params.features_count / params.doc_count"
}
}
}
}
}
}
Elastic bucket script doc:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

Serial Differencing Aggregation

This is my 1st attempt at Elasticsearch. I am trying to sort the data based on timestamp and in the result I want the difference between 2 consecutive timestamps.
I am trying to achieve the same through Serial Differencing Aggregation where I expect a field output with the difference. I do not get any output field with the difference.
So the question is this approach correct or there is something else that needs to be done?
"aggs": {
"users": {
"terms": {
"field": "ts"
},
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "ts",
"interval": "second"
},
"aggs": {
"the_sum": {
"sum": {
"field": "lemmings"
}
},
"thirtieth_difference": {
"serial_diff": {
"buckets_path": "the_sum",
"lag": 1
}
}
}
}
}
}
}
Here I do not have a field named "lemmings" but ElasticSearch does not complain.

For each country/colour/brand combination , find sum of number of items in elasticsearch

This is a portion of the data I have indexed in elasticsearch:
{
"country" : "India",
"colour" : "white",
"brand" : "sony"
"numberOfItems" : 3
}
I want to get the total sum of numberOfItems on a per country basis, per colour basis and per brand basis. Is there any way to do this in elasticsearch?
The following should land you straight to the answer.
Make sure you enable scripting before using it.
{
"aggs": {
"keys": {
"terms": {
"script": "doc['country'].value + doc['color'].value + doc['brand'].value"
},
"aggs": {
"keySum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
}
}
To get a single result you may use sum aggregation applied to a filtered query with term (terms) filter, e.g.:
{
"query": {
"filtered": {
"filter": {
"term": {
"country": "India"
}
}
}
},
"aggs": {
"total_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
To get statistics for all countries/colours/brands in a single pass over the data you may use the following query with 3 multi-bucket aggregations, each of them containing a single-bucket sum sub-aggregation:
{
"query": {
"match_all": {}
},
"aggs": {
"countries": {
"terms": {
"field": "country"
},
"aggs": {
"country_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
},
"colours": {
"terms": {
"field": "colour"
},
"aggs": {
"colour_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
},
"brands": {
"terms": {
"field": "brand"
},
"aggs": {
"brand_sum": {
"sum": {
"field": "numberOfItems"
}
}
}
}
}
}

elasticsearch getting too many results, need help filtering query

I'm having much problem understanding the underlying of ES querying system.
I've got the following query for example:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"referer": "www.xx.yy.com"
}
},
{
"range": {
"#timestamp": {
"gte": "now",
"lt": "now-1h"
}
}
}
]
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
That request get too many results:
"status" : 500, "reason" :
"ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException:
Data too large, data for field [#timestamp] would be larger than limit
of [3200306380/2.9gb]]; nested:
UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException:
Data too large, data for field [#timestamp] would be larger than limit
of [3200306380/2.9gb]]; nested: CircuitBreakingException[Data too
large, data for field [#timestamp] would be larger than limit of
[3200306380/2.9gb]]; "
I've tryied that request:
{
"size": 0,
"filter": {
"and": [
{
"term": {
"referer": "www.geoportail.gouv.fr"
}
},
{
"range": {
"#timestamp": {
"from": "2014-10-04",
"to": "2014-10-05"
}
}
}
]
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
I would like to filter the data in order to be able to get a correct result, any help would be much appreciated!
I found a solution, it's kind of weird.
I've followed dimzak adviced and clear the cache:
curl --noproxy localhost -XPOST "http://localhost:9200/_cache/clear"
Then I used filtering instead of querying as Olly suggested:
{
"size": 0,
"query": {
"filtered": {
"query": {
"term": {
"referer": "www.xx.yy.fr"
}
},
"filter" : {
"range": {
"#timestamp": {
"from": "2014-10-04T00:00",
"to": "2014-10-05T00:00"
}
}
}
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
I cannot give you both the ansxwer, I think dimzak deserves it best, but thumbs up to you two guys :)
You can try clearing cache first and then execute the above query as shown here.
Another solution may be to remove interval or reduce time range in your query...
My best bet would be either clear cache first, or allocate more memory to elasticsearch (more here)
Using a filter would improve performance:
{
"size": 0,
"query": {
"filtered": {
"query": {
"term": {
"referer": "www.xx.yy.com"
}
},
"filter" : {"range": {
"#timestamp": { "gte": "now", "lt": "now-1h"
}
}
}
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
You may also find that date range is better than date histogram - you need to define the buckets yourself.
is the referer field being analysed? or do you want an exact match on this - if so set it to not_analyzed.
is there much cardinality in your hostname field? have you tried pre-hashing the values?

Resources