Ratio with elasticsearch - elasticsearch

I have a list of customers with this structure:
{
"name" : "Toya Romano",
"hungry" : false,
"date" : 1420090500020
}
I would like to get the ratio of people who are hungry. How can I do it with an ElasticSearch query? I am running ES 2.3.

Rather a hacky approach because of this issue, but this should work:
{
"size": 0,
"aggs": {
"whatever": {
"filters": {
"filters": [{}]
},
"aggs": {
"all_people": {
"filter": {}
},
"hungry_count": {
"filter": {
"term": {
"hungry": true
}
}
},
"hungry_ratio": {
"bucket_script": {
"buckets_path": {
"total_hungry": "hungry_count._count",
"all": "all_people._count"
},
"script": "total_hungry/all"
}
}
}
}
}
}
With the result like this:
"buckets": [
{
"doc_count": 5,
"all_people": {
"doc_count": 5
},
"hungry_count": {
"doc_count": 3
},
"hungry_ratio": {
"value": 0.6
}
}
]

Related

Elastic aggregation on specific values from within one field

I am migrating my db from postgres to elasticsearch. My postgres query looks like this:
select site_id, count(*) from r_2332 where site_id in ('1300','1364') and date >= '2021-01-25' and date <= '2021-01-30'
The expected result is as follows:
site_id count
1300 1234
1364 2345
I am trying to derive the same result from elasticsearch aggs. I have tried the following:
GET /r_2332/_search
{
"query": {
"bool" : {
"should" : [
{"match" : {"site_id": "1300"}},
{"match" : {"site_id": "1364"}}
],"minimum_should_match": 1
}
},
"aggs" : {
"footfall" : {
"range" : {
"field" : "date",
"ranges" : [
{
"from":"2021-01-21",
"to":"2021-01-30"
}
]
}
}
}
}
This gives me the result as follows:
"aggregations":{"footfall":{"buckets":[{"key":"2021-01-21T00:00:00.000Z-2021-01-30T00:00:00.000Z","from":1.6111872E12,"from_as_string":"2021-01-21T00:00:00.000Z","to":1.6119648E12,"to_as_string":"2021-01-30T00:00:00.000Z","doc_count":2679}]}
and this:
GET /r_2332/_search
{
"query": {
"terms": {
"site_id": [ "1300", "1364" ],
"boost": 1.0
}
},
"aggs" : {
"footfall" : {
"range" : {
"field" : "date",
"ranges" : [
{
"from":"2021-01-21",
"to":"2021-01-30"
}
]
}
}
}
}
This provided the same result:
"aggregations":{"footfall":{"buckets":[{"key":"2021-01-21T00:00:00.000Z-2021-01-30T00:00:00.000Z","from":1.6111872E12,"from_as_string":"2021-01-21T00:00:00.000Z","to":1.6119648E12,"to_as_string":"2021-01-30T00:00:00.000Z","doc_count":2679}]}
How do I get the result separately for each site_id?
You can use a combination of terms and range aggregation to achieve your task
Adding a working example with index data, search query and search result
Index Data:
{
"site_id":1365,
"date":"2021-01-24"
}
{
"site_id":1300,
"date":"2021-01-22"
}
{
"site_id":1300,
"date":"2020-01-22"
}
{
"site_id":1364,
"date":"2021-01-24"
}
Search Query:
{
"size": 0,
"aggs": {
"siteId": {
"terms": {
"field": "site_id",
"include": [
1300,
1364
]
},
"aggs": {
"footfall": {
"range": {
"field": "date",
"ranges": [
{
"from": "2021-01-21",
"to": "2021-01-30"
}
]
}
}
}
}
}
}
Search Result:
"aggregations": {
"siteId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1300,
"doc_count": 2,
"footfall": {
"buckets": [
{
"key": "2021-01-21T00:00:00.000Z-2021-01-30T00:00:00.000Z",
"from": 1.6111872E12,
"from_as_string": "2021-01-21T00:00:00.000Z",
"to": 1.6119648E12,
"to_as_string": "2021-01-30T00:00:00.000Z",
"doc_count": 1 // note this
}
]
}
},
{
"key": 1364,
"doc_count": 1,
"footfall": {
"buckets": [
{
"key": "2021-01-21T00:00:00.000Z-2021-01-30T00:00:00.000Z",
"from": 1.6111872E12,
"from_as_string": "2021-01-21T00:00:00.000Z",
"to": 1.6119648E12,
"to_as_string": "2021-01-30T00:00:00.000Z",
"doc_count": 1 // note this
}
]
}
}
]
}
}
This might perform better
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"terms": {
"site_id": [
"1300",
"1365"
]
}
},
{
"range": {
"date": {
"gte": "2021-01-21",
"lte": "2021-01-24"
}
}
}
]
}
},
"aggs": {
"group_by": {
"terms": {
"field": "site_id"
}
}
}
}

Elasticsearch Date Histogram with a Point in Time count of documents

I am attempting to create a date histogram showing the number of employees on a monthly basis.
Employee mapping looks something like this:
{
"number": 1234,
"firstName": "Chris",
"lastName": "Smith",
"employmentDates: [
{
"startDate": "2014-10-03T06:00:00Z",
"endDate": "2017-11-04T06:00:00Z"
}
],
"lastPaidOnDate": "2017-11-10T06:00:00Z",
....
}
Given a start end scenario like this (for three employees):
|----------------|
|-----------------------------|
|---| |---------------------|
^ ^ ^ ^ ^ ^
I would expect the histogram to be similar to this:
"aggregations": {
"employees_per_month": {
"buckets": [
{
"key_as_string": "2017-01-01",
"doc_count": 1
},
{
"key_as_string": "2017-02-01",
"doc_count": 2
},
{
"key_as_string": "2017-03-01",
"doc_count": 2
},
{
"key_as_string": "2017-04-01",
"doc_count": 3
},
{
"key_as_string": "2017-05-01",
"doc_count": 3
},
{
"key_as_string": "2017-06-01",
"doc_count": 2
}
]
}
}
It seems like I need to have a sub-aggregation on a scripted field, but I'm not sure where to start.
Your assistance is greatly appreciated.
I believe it's can be done by using DateHistogram. But I'm suggesting a a simple approach. You will have to run the query every time for one specific month:
{
"size": 0,
"aggregations": {
"bool_agg": {
"filter": {
"bool": {
"must": [
{
"range": {
"employmentDates.startDate": {
"lt": "2017-12-01T00:00:00Z"
}
}
},
{
"range": {
"employmentDates.endDate": {
"gte": "2017-11-01T00:00:00Z"
}
}
}
]
}
},
"aggregations": {
"distinct_agg": {
"cardinality": {
"field": "number"
}
}
}
}
}
}
bool_agg: using Filter Aggregation to filter only employment in November
distinct_agg: using Cardinality Aggregation to count, by unique field number, the total employees
Pay attention that if employmentDates would contain more then one record, e.g:
"employmentDates: [
{
"startDate": "2014-10-03T06:00:00Z",
"endDate": "2017-11-04T06:00:00Z"
}
{
"startDate": "2018-03-03T06:00:00Z",
"endDate": "2018-07-04T06:00:00Z"
}
You will must go nested with Nested Datatype, example can be found here.
And update the query to:
{
"size": 0,
"aggregations": {
"nested_agg": {
"nested": {
"path": "employmentDates"
},
"aggregations": {
"bool_agg": {
"filter": {
"bool": {
"must": [
{
"range": {
"employmentDates.startDate": {
"lt": "2017-12-01T00:00:00Z"
}
}
},
{
"range": {
"employmentDates.endDate": {
"gte": "2017-11-01T00:00:00Z"
}
}
}
]
}
},
"aggregations": {
"comment_to_issue": {
"reverse_nested": {},
"aggregations": {
"distinct_agg": {
"cardinality": {
"field": "number"
}
}
}
}
}
}
}
}
}
}

Can ElasticSearch aggregate over top N items in each sorted bucket

I have this query that buckets the records by data source code, and computes an average over all records in each bucket.
How could I modify it so that each bucket is limited to having (at most) top N records when ordered by record.timestamp desc (or any other record field for that matter)
The end effect I want is an average per bucket using the most recent N records rather than all records (so the doc_count in each bucket would have an upper limit of N).
I've searched and experimented extensively with no success.
Current query:
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"jobType": "LiveEventScoring"
}
},
{
"term": {
"host": "MTVMDANS"
}
},
{
"term": {
"measurement": "EventDataLoadFromCacheDuration"
}
}
]
}
}
}
},
"aggs": {
"data-sources": {
"terms": {
"field": "dataSourceCode"
},
"aggs": {
"avgDuration": {
"avg": {
"field": "elapsedMs"
}
}
}
}
}
}
Results in:
"aggregations": {
"data-sources": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "AU_VIRT",
"doc_count": 6259,
"avgDuration": {
"value": 3525.683176226234
}
},
{
"key": "AU_HN_VIRT",
"doc_count": 2812,
"avgDuration": {
"value": 3032.0771692745375
}
},
{
"key": "GB_VIRT",
"doc_count": 1845,
"avgDuration": {
"value": 1432.39945799458
}
}
]
}
}
}
Alternately if grabbing top N from sorted bucket is not possible, I could do multiple queries one for each dataSourceCode, e.g. for AU_VIRT:
{
"size":0,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"jobType": "LiveEventScoring"
}
},
{
"term": {
"host": "MTVMDANS"
}
},
{
"term": {
"dataSourceCode": "AU_VIRT"
}
},
{
"term": {
"measurement": "EventDataLoadFromCacheDuration"
}
}
]
}
}
}
},
"aggs": {
"avgDuration": {
"avg": {
"field": "elapsedMs"
}
}
}
}
}
but I am now challenged in how I make the avgDuration work on top N results sorted by timestamp desc.

Elasticsearch : Is it possible to not analysed aggregation query on analysed field?

I have certain document which stores the brand names in analysed form for ex: {"name":"Sam-sung"} {"name":"Motion:Systems"}. There are cases where i would want to aggregation these brands under timestamp.
my query as follow ,
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"#timestamp":{
"gte":"2016-07-18T14:23:41.459Z",
"lte":"2016-07-18T14:53:10.017Z"
}
}
},
"aggs": {
"execute_time": {
"terms": {
"field": "brands",
"size": 0
}
}
}
}
}
}
but the return results will be
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "Sam",
"doc_count": 5
},
{
"key": "sung",
"doc_count": 5
},
{
"key": "Motion",
"doc_count": 1
},
{
"key": "Systems",
"doc_count": 1
}
]
}
}
}
but i want to the results is
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "Sam-sung",
"doc_count": 5
},
{
"key": "Motion:Systems",
"doc_count": 1
}
]
}
}
}
Is there any way in which i can make not analysed query on analysed field in elastic search?
You need to add a not_analyzed sub-field to your brands fields and then aggregate on that field.
PUT /index/_mapping/type
{
"properties": {
"brands": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then you need to fully reindex your data in order to populate the new sub-fields brands.raw.
Finally, you can change your query to this:
POST index/_search
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"#timestamp":{
"gte":"2016-07-18T14:23:41.459Z",
"lte":"2016-07-18T14:53:10.017Z"
}
}
},
"aggs": {
"execute_time": {
"terms": {
"field": "brands.raw",
"size": 0
}
}
}
}
}
}

elasticsearch: using nested agg after reverse_nested shows higher count than expected

Using Elasticsearch 2.2.0, I am doing this:
Grouping by a nested field: nested_path.nested_field
Using a reverse_nested agg so I can apply this filter: non_nested_field == "yay"
Using a nested agg so I can then get a count of the nested field I am grouping by: nested_path.nested_field
Problem: By using the reverse_nested agg I am getting a higher doc_count than I would expect.
Here is the mapping and docs I am indexing:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"nested_path": {
"type": "nested",
"properties": {
"nested_field": {
"type": "string"
}
}
},
"non_nested_field": {
"type": "string"
}
}
}
}
}
POST /my_index/my_type/1
{
"non_nested_field": "whoray",
"nested_path": [
{
"nested_field": "yes"
},
{
"nested_field": "yes"
},
{
"nested_field": "no"
}
]
}
POST /my_index/my_type/2
{
"non_nested_field": "yay",
"nested_path": [
{
"nested_field": "maybe"
},
{
"nested_field": "no"
}
]
}
Request body:
POST my_index/my_type/_search
{
"aggs": {
"nested_option": {
"nested": {
"path": "nested_path"
},
"aggs": {
"group_list": {
"terms": {
"field": "nested_path.nested_field",
"size": 100
},
"aggs": {
"level_1": {
"reverse_nested": {},
"aggs": {
"level_2": {
"filter": {
"term": {
"non_nested_field": "yay"
}
},
"aggs": {
"level_3": {
"nested": {
"path": "nested_path"
},
"aggs": {
"stat": {
"value_count": {
"field": "nested_path.nested_field"
}
}
}
}
}
}
}
}
}
}
}
}
},
"size": 0
}
Part of the response I get is this:
{
"aggregations": {
"nested_option": {
"doc_count": 5,
"group_list": {
"buckets": [
{
"key": "no",
"doc_count": 2,
"level_1": {
"doc_count": 2,
"level_2": {
"doc_count": 1,
"level_3": {
"doc_count": 2,
"stat": {
"value": 2
}
}
}
}
}
//....
]
}
}
}
}
In the first element of the buckets array in the response, level_1.level_2.doc_count is 1, and this is correct, because there's only one of the two docs indexed where nested_path.nested_field == "no" and non_nested_field == "yay". But level_1.level_2.level_3.doc_count in the response is 2. It should only be 1. This seems like a bug to me.

Resources