How to count number of fields inside nested field? - Elasticsearch - elasticsearch

I did the following mapping. I would like to count the number of products in each nested field "products" (for each document separately). I would also like to do a histogram aggregation, so that I would know the number of specific bucket sizes.
PUT /receipts
{
"mappings": {
"properties": {
"id" : {
"type": "integer"
},
"user_id" : {
"type": "integer"
},
"date" : {
"type": "date"
},
"sum" : {
"type": "double"
},
"products" : {
"type": "nested",
"properties": {
"name" : {
"type" : "text"
},
"number" : {
"type" : "double"
},
"price_single" : {
"type" : "double"
},
"price_total" : {
"type" : "double"
}
}
}
}
}
}
I've tried this query, but I get the number of all the products instead of number of products for each document separately.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products"
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count" : 6552,
"bucket_size" : {
"value" : 0
}
}
}
UPDATE
Now I have this code where I make separate buckets for each id and count the number of products inside them.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size" : 0,
"aggs": {
"terms":{
"terms":{
"field": "_id"
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 490,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"nested" : {
"doc_count" : 21,
"bucket_size" : {
"value" : 21
}
}
},
{
"key" : "10",
"doc_count" : 1,
"nested" : {
"doc_count" : 5,
"bucket_size" : {
"value" : 5
}
}
},
{
"key" : "100",
"doc_count" : 1,
"nested" : {
"doc_count" : 12,
"bucket_size" : {
"value" : 12
}
}
},
...
Is is possible to group these values (21, 5, 12, ...) into buckets to make a histogram of them?

products is only the path to the array of individual products, not an aggregatable field. So you'll need to use it on one of your product's field -- such as the number:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
Note that is a product has no number, it'll not contribute to the total count. It's therefore best practice to always include an ID in each of them and then aggregate on that field.
Alternatively you could use a script to account for missing values. Luckily value_count does not deduplicate -- meaning if two products are alike and/or have empty values, they'll still be counted as two:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"script": {
"source": "doc['products.number'].toString()"
}
}
}
}
}
}
}
UPDATE
You could also use a nested composite aggregation which'll give you the histogrammed product count w/ the corresponding receipt id:
GET /receipts/_search
{
"size": 0,
"aggs": {
"my_aggs": {
"nested": {
"path": "products"
},
"aggs": {
"composite_parent": {
"composite": {
"sources": [
{
"receipt_id": {
"terms": {
"field": "_id"
}
}
},
{
"product_number": {
"histogram": {
"field": "products.number",
"interval": 1
}
}
}
]
}
}
}
}
}
}
The interval is modifiable.

Related

Sub-aggregate a multi-level nested composite aggregation

I'm trying to set up a search query that should composite aggregate a collection by a multi-level nested field and give me some sub-aggregation metrics from this collection. I was able to fetch the composite aggregation with its buckets as expected but the sub-aggregation metrics come with 0 for all buckets. I'm not sure if I am failing to correctly point out what fields the sub-aggregation should consider or if it should be placed inside a different part of the query.
My collection looks similar to the following:
{
id: '32ead132eq13w21',
statistics: {
clicks: 123,
views: 456
},
categories: [{ //nested type
name: 'color',
tags: [{ //nested type
slug: 'blue'
},{
slug: 'red'
}]
}]
}
Bellow you can find what I have tried so far. All buckets come with clicks sum as 0 even though all documents have a set clicks value.
GET /acounts-123321/_search
{
"size": 0,
"aggs": {
"nested_categories": {
"nested": {
"path": "categories"
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"group": {
"composite": {
"size": 100,
"sources": [
{ "slug": { "terms" : { "field": "categories.tags.slug"} }}
]
},
"aggregations": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
The response body I have so far:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1304,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"nested_categories" : {
"doc_count" : 1486,
"nested_tags" : {
"doc_count" : 1486,
"group" : {
"buckets" : [
{
"key" : {
"slug" : "red"
},
"doc_count" : 268,
"clicks" : {
"value" : 0.0
}
}, {
"key" : {
"slug" : "blue"
},
"doc_count" : 122,
"clicks" : {
"value" : 0.0
},
.....
]
}
}
}
}
}
In order for this to work, all sources in the composite aggregation would need to be under the same nested context.
I've answered something similar a while ago. The asker needed to put the nested values onto the top level. You have the opposite challenge -- given that the stats.clicks field is on the top level, you'd need to duplicate it across each entry of the categories.tags which, I suspect, won't be feasible because you're likely updating these stats every now and then…
If you're OK with skipping the composite approach and using the terms agg without it, you could make the summation work by jumping back to the top level thru reverse_nested:
{
"size": 0,
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
This'll work just as fine but won't offer pagination.
Clarification
If you needed a color filter, you could do:
{
"size": 0,
"aggs": {
"categories_parent": {
"nested": {
"path": "categories"
},
"aggs": {
"filtered_by_color": {
"filter": {
"term": {
"categories.name": "color"
}
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
}
}
}
}

I need to get average document count by date in elasticsearch

I want to get average document count by date without getting the whole bunch of buckets data and get average value by hand cause there are years of data and when I group by the date I get too_many_buckets_exception.
So my current query is
{
"query": {
"bool": {
"must": [],
"filter": []
}
},
"aggs": {
"groupByChannle": {
"terms": {
"field": "channel"
},
"aggs": {
"docs_per_day": {
"date_histogram": {
"field": "message_date",
"fixed_interval": "1d"
}
}
}
}
}
}
How can I get an average doc count grouped by message_date(day) and channel without taking buckets array of this data
"buckets" : [
{
"key_as_string" : "2018-03-17 00:00:00",
"key" : 1521244800000,
"doc_count" : 4027
},
{
"key_as_string" : "2018-03-18 00:00:00",
"key" : 1521331200000,
"doc_count" : 10133
},
...thousands of rows
]
my index structure looks like this
"mappings" : {
"properties" : {
"channel" : {
"type" : "keyword"
},
"message" : {
"type" : "text"
},
"message_date" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss"
},
}
}
By this query, I want to get JUST A AVERAGE DOC COUNT BY DATE and nothing else
"avg_count": {
"avg_bucket": {
"buckets_path": "docs_per_day>_count"
}
}
after docs_per_day ending this.
avg_count provides average count.
_count refers the bucket count
I think, that you can use stats aggregation with the script :
{
"size": 0,
"aggs": {
"term": {
"terms": {
"field": "chanel"
},
"aggs": {
"stats": {
"stats": {
"field": "message_date"
}
},
"result": {
"bucket_script": {
"buckets_path": {
"max" : "stats.max",
"min" : "stats.min",
"count" : "stats.count"
},
"script": "params.count/(params.max - params.min)/1000/86400)"
}
}
}
}
}
}

How to aggregate nested fields to include null values?

I'm having trouble aggregating my nested data to include null values as well.
I'm using Elasticsearch version 6.8
I'll simplify the problem, I've a nested field that looks like:
PUT test/doc/_mapping
{
"properties": {
"fields": {
"type" : "nested",
"properties" : {
"name" : {
"type" : "keyword"
},
"value" : {
"type" : "long"
}
}
}
}
}
I created 3 documents:
PUT test/doc/1
{
"fields" : {
"name" : "aaa",
"value" : 1
}
}
PUT test/doc/2
{
"fields" : [{
"name" : "aaa",
"value" : 1
},
{
"name" : "bbb",
"value" : 2
}]
}
PUT test/doc/3
{
"fields" : [
{
"name" : "bbb",
"value" : 2
}]
}
Now I want to group my data to get how many documents there are where name="bbb" group by each value.
For the above data I want to get:
2 – 2 documents
N/A – 1 document (the first document where bbb is missing)
The problem is with the null values, I cannot find a way to match the documents where "bbb" is null and put them in a N/A bucket.
So far I wrote a query that match the values where "bbb" exist:
GET test/doc/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"my_agg": {
"nested": {
"path": "fields"
},
"aggs": {
"my_filter": {
"filter": {
"term": {
"fields.name": "bbb"
}
},
"aggs": {
"my_term": {
"terms": {
"field": "fields.value"
}
}
}
}
}
}
}
}
And the response is:
"aggregations" : {
"my_agg" : {
"doc_count" : 4,
"my_filter" : {
"doc_count" : 2,
"my_term" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 2
}
]
}
}
}
}
I want to get also:
"key" : 0 (for N/A)
"doc_count" : 1
What am I missing?
If I understand this correctly, you want to know the buckets where there was zero/null/no matches. You can use min_doc_count
GET test/doc/_search
{
"size": ,
"query": {
"match_all": {}
},
"aggs": {
"my_agg": {
"nested": {
"path": "fields"
},
"aggs": {
"my_filter": {
"filter": {
"term": {
"fields.name": "bbb"
}
},
"aggs": {
"my_term": {
"terms": {
"field": "fields.value", --> you can also use "_id" to get count based on each document
"min_doc_count": 0 --> this will include all the buckets where count is zero/ or there is no match.
}
}
}
}
}
}
}
}
You could also use inner_hits to find a hit in each document or use _id in above aggregations query.
POST test/_search
{
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"nested": {
"path": "fields",
"query": {
"match": {
"fields.name": "bbb"
}
},
"inner_hits": {}
}
}
]
}
}
}

Elastic Search: Selecting multiple vlaues in aggregates

In Elastic Search I have the following index with 'allocated_bytes', 'total_bytes' and other fields:
{
"_index" : "metrics-blockstore_capacity-2017_06",
"_type" : "datapoint",
"_id" : "AVzHwgsi9KuwEU6jCXy5",
"_score" : 1.0,
"_source" : {
"timestamp" : 1498000001000,
"resource_guid" : "2185d15c-5298-44ac-8646-37575490125d",
"allocated_bytes" : 1.159196672E9,
"resource_type" : "machine",
"total_bytes" : 1.460811776E11,
"machine" : "2185d15c-5298-44ac-8646-37575490125d"
}
I have the following query to
1)get a point for 30 minute interval using date-histogram
2)group by field on resource_guid.
3)max aggregate to find the max value.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValue": {
"max": {
"field": "allocated_bytes"
}
}
}
},
"sumUnique": {
"sum_bucket": {
"buckets_path": "groupByField>maxValue"
}
}
}
}
}
}
But with this query I am able to get only allocated_bytes, but I need to have both allocated_bytes and total_bytes at the result point.
Following is the result from the above query:
{
"key_as_string" : "2017-06-20T21:00:00.000Z",
"key" : 1497992400000,
"doc_count" : 9,
"groupByField" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "2185d15c-5298-44ac-8646-37575490125d",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
}, {
"key" : "c3513cdd-58bb-4f8e-9b4c-467230b4f6e2",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156165632E9
}
}, {
"key" : "eff13403-9737-4d08-9dca-fb6c12c3a6fa",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
} ]
},
"sumUnique" : {
"value" : 3.468529664E9
}
}
I do need both allocated_bytes and total_bytes. How do I get multiple fields( allocated_bytes, total_bytes) for each point?
For example:
"sumUnique" : {
"Allocatedvalue" : 3.468529664E9,
"TotalValue" : 9.468529664E9
}
or like this:
"allocatedBytessumUnique" : {
"value" : 3.468529664E9
}
"totalBytessumUnique" : {
"value" : 9.468529664E9
},
You can just add another aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValueAllocated": {
"max": {
"field": "allocated_bytes"
}
},
"maxValueTotal": {
"max": {
"field": "total_bytes"
}
}
}
},
"sumUniqueAllocatedBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueAllocated"
}
},
"sumUniqueTotalBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueTotal"
}
}
}
}
}
}
I hope you are aware that sum_bucket calculates sibling aggregations only, in this case gives sum of max values, not the sum of total_bytes. If you want to get sum of total_bytes you can use sum aggregation

Elasticsearch aggregation by arrays of String

I have an ElasticSearch index, where I store telephony transactions (SMS, MMS, Calls, etc ) with their associated costs.
The key of these documents are the MSISDN (MSISDN = phone number). In my app, I know that there are group of users. Each users can have one or more MSISDN.
Here is the mapping of this kind of documents :
"mappings" : {
"cdr" : {
"properties" : {
"callDatetime" : {
"type" : "long"
},
"callSource" : {
"type" : "string"
},
"callType" : {
"type" : "string"
},
"callZone" : {
"type" : "string"
},
"calledNumber" : {
"type" : "string"
},
"companyKey" : {
"type" : "string"
},
"consumption" : {
"properties" : {
"data" : {
"type" : "long"
},
"voice" : {
"type" : "long"
}
}
},
"cost" : {
"type" : "double"
},
"country" : {
"type" : "string"
},
"included" : {
"type" : "boolean"
},
"msisdn" : {
"type" : "string"
},
"network" : {
"type" : "string"
}
}
}
}
My goal and issue :
My goal is to make a query that retrieve cost by callType by group. But groups are not represented in ElasticSearch, only in my PostgreSQL database.
So I will make a method that retrieves all the MSISDN for every existing group, and get something like a List of String arrays, containing every MSISDN within each group.
Let's say I have something like :
"msisdn_by_group" : [
{
"group1" : ["01111111111", "02222222222", "033333333333", "044444444444"]
},
{
"group2" : ["05555555555","06666666666"]
}
]
Now, I will use this to generate an Elasticsearch query. I want to make with an aggregation, the sum of the cost, for all those terms in different buckets, and then split it again by callType. (to make a stackedbar chart).
I've tried several things, but didn't manage to make it work (histogram, buckets, term and sum was mainly the keyword i'm playing with).
If somebody here can help me with the order, and the keywords I can use to achieve this, it would be great :) Thanks
EDIT :
Here is my last try :
QUERY:
{
"aggs" : {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
I go the expected result, but it missing the "group" split, as I don't know how to pass the MSISDN arrays as a criteria :
RESULT :
"aggregations": {
"cost_histogram": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "data",
"doc_count": 5925,
"cost_histogram_sum": {
"value": 0
}
},
{
"key": "sms_mms",
"doc_count": 5804,
"cost_histogram_sum": {
"value": 91.76999999999995
}
},
{
"key": "voice",
"doc_count": 5299,
"cost_histogram_sum": {
"value": 194.1196
}
},
{
"key": "sms_mms_plus",
"doc_count": 35,
"cost_histogram_sum": {
"value": 7.2976
}
}
]
}
}
Ok I found out how to make this with one query, but it's damn a long query because it repeats for every group, but I have no choise. I'm using the "filter" aggregator.
Here is a working example based on the array I wrote in my question above :
POST localhost:9200/cdr/_search?size=0
{
"query": {
"term" : {
"companyKey" : 1
}
},
"aggs" : {
"group_1_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "01111111111"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "02222222222"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "03333333333"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "04444444444"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
},
"group_2_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "05555555555"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "06666666666"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
}
}
Thanks to the newer versions of Elasticsearch we can now nest very deep aggregations, but it's still a bit too bad that we can't pass arrays of values to an "OR" operator or something like that. It could reduce the size of those queries, I guess. Even if they are a bit special and used in niche cases, as mine.

Resources