I would like to put a condition in other word filter data based on aggregated data.
currently, I have a query
GET sense/_search
{
"size": 0,
"aggs": {
"dates": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"format": "yyyy-MM-dd",
"offset": "+4h"
},
"aggs": {
"unique_sessions": {
"terms": {
"field": "sessionId"
}
}
}
}
}
}
which returns this kind of data
{
"aggregations" : {
"dates" : {
"buckets" : [
{
"key_as_string" : "2019-03-31",
"key" : 1554004800000,
"doc_count" : 14,
"unique_sessions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "83e1c3a4-341c-4ac3-a81e-f00336ee1dfb",
"doc_count" : 3
},
{
"key" : "99c4d312-2477-4bf7-ad02-ef76f50443f9",
"doc_count" : 3
},
{
"key" : "425b840f-9604-4f1d-ab18-96a9a7ae44e0",
"doc_count" : 1
},
{
"key" : "580b1f6c-6256-4f38-9803-2cc79a0a63d7",
"doc_count" : 2
},
{
"key" : "8929d75d-153c-4b66-8dd7-2eacb7974b95",
"doc_count" : 1
},
{
"key" : "8da5d732-d1e7-4a63-8f02-2b84a8bdcb62",
"doc_count" : 2
}
]
}
},
{
"key_as_string" : "2019-04-01",
"key" : 1554091200000,
"doc_count" : 1,
"unique_sessions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "513d4532-304d-44c7-bdc7-398795800383",
"doc_count" : 1
},
{
"key" : "8da5d732-d1e7-4a63-8f02-2791poc34gq1",
"doc_count" : 2
}
]
}
}
]
}
}
}
So I would like to retrieve the count of unique sesssionId where doc_count equal to 1.
Which means I expect result where date histogram with key "2019-03-31"
will show 2 (because of aggregation with name unique_sessions in buckets has only two sessions with doc_count equal to one) and accordingly "2019-04-01" will show 1 as a result.
Have no clue how to realize this aggregation.
You would need to make use of Bucket Selector Aggregation on the terms aggregation that you have.
Below is how your query would appear:
Sample Query
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"dates":{
"date_histogram":{
"field":"#timestamp",
"interval":"1d",
"format":"yyyy-MM-dd",
"offset":"+4h"
},
"aggs":{
"unique_sessions":{
"terms":{
"field":"sessionId"
},
"aggs":{
"unique_buckets":{
"bucket_selector":{
"buckets_path":{
"count":"_count"
},
"script":"params.count==1"
}
}
}
}
}
}
}
}
Note that you'd end up with empty buckets in that situation as mentioned in the below response.
Sample Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"dates": {
"buckets": [
{
"key_as_string": "2018-12-31",
"key": 1546228800000,
"doc_count": 3,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "83e1c3a4-3AFA1c-4ac3-a81e-f00336ee1dfb",
"doc_count": 1
}
]
}
},
{
"key_as_string": "2019-01-01",
"key": 1546315200000,
"doc_count": 0,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key_as_string": "2019-01-02",
"key": 1546401600000,
"doc_count": 3,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key_as_string": "2019-01-03",
"key": 1546488000000,
"doc_count": 3,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "83e1c3a4-3AFA1c-4ab3-a81e-f00336ee1dfb",
"doc_count": 1
}
]
}
}
]
}
}
}
In that case, if you would want to filter the buckets to only show the parent buckets which matches the child buckets having count==1 just make use of the below query where I've added another bucket selector clause.
Note carefully the structure of the query.
Refined Query Solution:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"dates":{
"date_histogram":{
"field":"#timestamp",
"interval":"1d",
"format":"yyyy-MM-dd",
"offset":"+4h"
},
"aggs":{
"unique_sessions":{
"terms":{
"field":"sessionId"
},
"aggs":{
"unique_buckets":{
"bucket_selector":{
"buckets_path":{
"count":"_count"
},
"script":"params.count==1"
}
}
}
},
"terms_bucket_clause": {
"bucket_selector": {
"buckets_path": {
"count": "unique_sessions._bucket_count"
},
"script": "params.count>0"
}
}
}
}
}
}
Refined Query Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"dates": {
"buckets": [
{
"key_as_string": "2018-12-31",
"key": 1546228800000,
"doc_count": 3,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "83e1c3a4-3AFA1c-4ac3-a81e-f00336ee1dfb",
"doc_count": 1
}
]
}
},
{
"key_as_string": "2019-01-03",
"key": 1546488000000,
"doc_count": 3,
"unique_sessions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "83e1c3a4-3AFA1c-4ab3-a81e-f00336ee1dfb",
"doc_count": 1
}
]
}
}
]
}
}
}
Do note the difference in the results in both the query. Hope this helps!
Related
I have documents with the following structure (very much simplified for the example):
"documents": [
{
"name": "Document 1",
"collections" : [
{
"id": 30,
"title" : "Research"
},
{
"id": 45,
"title" : "Events"
},
{
"id" : 52,
"title" : "International"
}
]
},
{
"name": "Document 2",
"collections" : [
{
"id": 45,
"title" : "Events"
},
{
"id" : 63,
"title" : "Development"
}
]
}
]
I want an aggregation of the collection. It works fine when I do it like this:
"aggs": {
"collections": {
"terms": {
"field": "collections.title",
"size": 30
}
}
}
I get a nice result as expected:
"aggregations" : {
"collections" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Research",
"doc_count" : 18
},
{
"key" : "Events",
"doc_count" : 14
},
{
"key" : "International",
"doc_count" : 13
},
{
"key" : "Development",
"doc_count" : 8
}
]
}
}
However, I want the id included as well. So I tried this:
"aggs": {
"collections": {
"terms": {
"field": "collections.title",
"size": 30
}
},
"aggs": {
"id": {
"terms": {
"field": "collections.id",
"size": 1
}
}
}
}
This is the result:
"aggregations" : {
"collections" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Research",
"doc_count" : 18,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "30",
"doc_count" : 1
}
]
}
},
{
"key" : "Events",
"doc_count" : 14,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "45",
"doc_count" : 1
}
]
}
},
{
"key" : "International",
"doc_count" : 13,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "52",
"doc_count" : 1
}
]
}
},
{
"key" : "Development",
"doc_count" : 8,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "45",
"doc_count" : 1
}
]
}
}
]
}
}
At glance it looks good. But at a closer look the at the last element with Development (scroll down). The id should be 63, but is 45.
I have vague idea why this is, but I cannot find a solution for it. I also tried the multi_terms, but it gives a similar result. I think the issue has to do with the fact there are multiple collections within the document.
Does anyone know the correct solution to solve this issue?
The reason is in an object type mapping there is no relation between "title" and "id" , everything is flatenned by Elasticsearch under the hood, so:
"collections" : [
{
"id": 30,
"title" : "Research"
},
{
"id": 45,
"title" : "Events"
},
{
"id" : 52,
"title" : "International"
}
]
Becomes:
"collections.id": [30,45,52],
"collections.title": [Research, Events, International]
Elasticsearch doesn't know id 30 belongs to Research, or id 45 to Events.
You must use "nested" type to keep the relation between nested properties.
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
Solution: Use nested field type
Mappings
PUT test_nestedaggs
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"collections": {
"type": "nested",
"properties": {
"title": {
"type": "keyword"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Documents
POST test_nestedaggs/_doc
{
"name": "Document 1",
"collections": [
{
"id": 30,
"title": "Research"
},
{
"id": 45,
"title": "Events"
},
{
"id": 52,
"title": "International"
}
]
}
POST test_nestedaggs/_doc
{
"name": "Document 2",
"collections": [
{
"id": 45,
"title": "Events"
},
{
"id": 63,
"title": "Development"
}
]
}
Query
POST test_nestedaggs/_search?size=0
{
"aggs": {
"nested_collections": {
"nested": {
"path": "collections"
},
"aggs": {
"collections": {
"terms": {
"field": "collections.title"
},
"aggs": {
"ids": {
"terms": {
"field": "collections.id"
}
}
}
}
}
}
}
}
Results
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"nested_collections": {
"doc_count": 5,
"collections": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Events",
"doc_count": 2,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "45",
"doc_count": 2
}
]
}
},
{
"key": "Development",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "63",
"doc_count": 1
}
]
}
},
{
"key": "International",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "52",
"doc_count": 1
}
]
}
},
{
"key": "Research",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "30",
"doc_count": 1
}
]
}
}
]
}
}
}
}
You can read an article I wrote about that for details:
https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/
NOTE: If the number of child documents is too big and you are doing a lot of updates, consider changing the data model because each child document is an independent document in the index, and on each update on a child document the whole structure will reindex and that may affect the performance, there are also limits in the maximum of nested documents you can add. If the number is small like the example then it's fine.
I don't know if it is possible to return additional fields in the response for each bucket.
The current request returns correct results, but I'm missing additional field information required for later processing.
{
"query": {
"bool": {
"must": {
"match_all": {}
}
}
},
"track_total_hits": true,
"from": 0,
"size": 0,
"aggs": {
"strings": {
"nested": {
"path": "filter_data.string_facet"
},
"aggs": {
"names": {
"terms": {
"field": "filter_data.string_facet.facet-name"
},
"aggs": {
"values": {
"terms": {
"field": "filter_data.string_facet.facet-value"
}
}
}
}
}
}
}
Here is the result. Note the data in field filter_data how nested fields are structured.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [{
"_index": "my_index",
"_type": "_doc",
"_id": "7000043",
"_score": 1,
"_source": {
"item_data": {
"doc_id": 7000043,
"id": 7000043,
"live_state": 1,
"item_sku": "7000043",
"manufacturer_id": 1394
},
"filter_data": {
"string_facet": [{
"facet-name": "Thread size",
"facet-value": "G1/2",
"facet-name-id": 12,
"facet-value-id": 34
}]
}
}
}]
},
"aggregations": {
"strings": {
"doc_count": 5,
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "Thread size",
"doc_count": 2,
"values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "G1 1/4",
"doc_count": 1
}, {
"key": "G1/2",
"doc_count": 1
}]
}
}]
}
}
}
Is it possible to add additional fields to each bucket? It would be ideal to have such a format in the response. Basically add field facet-name-id anf facet-value-id to each bucket.
....
"buckets": [{
"key": "Thread size",
"doc_count": 2,
"values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "G1 1/4",
"facet-name-id": 12,
"facet-value-id": 34
"doc_count": 1
}, {
"key": "G1/2",
"facet-name-id": 12,
"facet-value-id": 35
"doc_count": 1
}]
}
}]
...
If this is not possible, what would you recommend?
Thanx.
Sure, you can use top_hits as a sub-aggrgation of your deepest facet-value aggregation:
POST my_index/_search?filter_path=aggregations.*.*.buckets.key,aggregations.*.*.buckets.values.buckets.key,aggregations.*.*.buckets.values.buckets.*.hits.hits._source
{
"query": {
"bool": {
"must": {
"match_all": {}
}
}
},
"track_total_hits": true,
"from": 0,
"size": 0,
"aggs": {
"strings": {
"nested": {
"path": "filter_data.string_facet"
},
"aggs": {
"names": {
"terms": {
"field": "filter_data.string_facet.facet-name"
},
"aggs": {
"values": {
"terms": {
"field": "filter_data.string_facet.facet-value"
},
"aggs": {
"my_top_hits": {
"top_hits": {
"size": 10,
"_source": ["filter_data.string_facet"]
}
}
}
}
}
}
}
}
}
}
which'd yield:
{
"aggregations" : {
"strings" : {
"names" : {
"buckets" : [
{
"key" : "Thread size",
"values" : {
"buckets" : [
{
"key" : "G1/2",
"my_top_hits" : {
"hits" : {
"hits" : [
{
"_source" : {
"facet-value" : "G1/2",
"facet-name" : "Thread size",
"facet-value-id" : 34,
"facet-name-id" : 12
}
}
]
}
}
}
]
}
}
]
}
}
}
}
Notice that my_top_hits is an array of string_facet objects instead of an object as you requested. That's because although you're already 2 facets deep (facet-name and then facet-value), there may still be multiple different facet-value-id and facet-name-id combinations covered by a given facet-value bucket.
Having said that, you can of course limit the top_hits count with the size parameter but then you wouldn't be able to say with certainty whether or not the first top hit's facets are representative of the whole bucket .
I have the below query to fetch aggregations using Elasticsearch 7.1.
{
"query": {
"bool": {
"filter": [
{
"bool": {
"must": [
{
"match": {
"viewedInFeed": true
}
}
]
}
}
]
}
},
"size": 0,
"aggs": {
"viewed_in_feed_by_day": {
"date_histogram": {
"field": "createdDate",
"interval" : "day",
"format" : "yyyy-MM-dd",
"min_doc_count": 1
}
}
}
}
The results are greater than 10,000 and I am not sure how to work since scroll is not available for aggregations. See the response below.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"viewed_in_feed_by_day": {
"buckets": [
{
"key_as_string": "2020-03-19",
"key": 1584576000000,
"doc_count": 3028
},
{
"key_as_string": "2020-03-20",
"key": 1584662400000,
"doc_count": 5384
},
{
"key_as_string": "2020-03-21",
"key": 1584748800000,
"doc_count": 3521
}
]
}
}
}
When using _count the count of documents is greater than 10,000 and even without the "min_doc_count": 1 doesn't return results, I know there are more data anyway.
Building on top of Jaspreet's comments I suggest the following:
Use track_total_hits=true to get the exact counts (since 7.0) while keeping the size=0 to only aggregate.
Use the stats aggregation to gain more insights before running your histograms.
GET dates/_search
{
"track_total_hits": true,
"size": 0,
"aggs": {
"dates_insights": {
"stats": {
"field": "createdDate"
}
},
"viewed_in_feed_by_day": {
"date_histogram": {
"field": "createdDate",
"interval" : "month",
"format" : "yyyy-MM-dd",
"min_doc_count": 1
}
}
}
}
yielding
...
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"viewed_in_feed_by_day" : {
"buckets" : [
{
"key_as_string" : "2020-01-01",
"key" : 1577836800000,
"doc_count" : 1
},
{
"key_as_string" : "2020-02-01",
"key" : 1580515200000,
"doc_count" : 1
},
{
"key_as_string" : "2020-03-01",
"key" : 1583020800000,
"doc_count" : 1
}
]
},
"dates_insights" : {
"count" : 3,
...
"min_as_string" : "2020-01-22T13:09:21.588Z",
"max_as_string" : "2020-03-22T13:09:21.588Z",
...
}
}
...
I'm trying to build a product search with facet filtering for a eCommerce app. For the product brand I have the following structure:
"brand": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"id": {
"type": "integer"
}
}
}
I want to make an aggregation by brand id and return the whole object and the count of the documents. Something like this:
"brands" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : {
"name": "Apple",
"id": 1
},
"doc_count" : 34
},
{
"key" : {
"name": "Samsung",
"id": 2
},
"doc_count" : 23
}
]
}
Currently I'm writing the aggregation like this:
"aggs": {
"brands": {
"nested": {
"path": "brand"
},
"aggs": {
"brandIds": {
"terms": {
"field": "brand.id"
}
}
}
},
}
and the result looks like this:
"aggregations" : {
"brands" : {
"doc_count" : 15,
"brandIds" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 4
},
{
"key" : 2,
"doc_count" : 2
}
]
}
}
}
You can use a Term Aggregation within a Terms Aggregation like this :
GET {index_name}/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"brands": {
"nested": {
"path": "brand"
},
"aggs": {
"brandIds": {
"terms": {
"field": "brand.id"
},
"aggs": {
"by name": {
"terms": {
"field": "brand.name.keyword",
"size": 10
}
}
}
}
}
}
}
}
This would result in something like this:
"aggregations": {
"brands": {
"doc_count": 68,
"brandIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 46,
"by name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Apple",
"doc_count": 46
}
]
}
},
{
"key": 2,
"doc_count": 22,
"ny id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Samsung",
"doc_count": 22
}
]
}
}
]
}
}
}
Hope this helps!!
I am using elasticsearch 1.7 and i have to find filter must not from aggregation key value
Below is the structure :
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z101"}
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z102"}
{"RU": "2016-06-25T15:07:46.144","zt":"bl","zi":"z103"}
{"RU": "2016-06-25T15:07:46.144","zt":"un","zi":"z201"}
{"RU": "2016-06-25T15:07:46.144","zt":"un","zi":"z202"}
{"RU": "2016-06-25T15:07:46.144","zt":"g1","zi":"z101"}
{"RU": "2016-06-25T15:07:46.144","zt":"g1","zi":"z502"}
{"RU": "2016-06-25T15:07:46.144","zt":"g2","zi":"z201"}
{"RU": "2016-06-25T15:07:46.144","zt":"g2","zi":"z503"}
My query :
{"size": 0,
"aggs": {
"findunique": {
"filter": {
"bool": {
"must_not": [
{
"terms": {
"zt": [
"bl",
"un"
]
}
}
],
"must": [
{
"terms": {
"zt": [
"g1",
"g2"
]
}
}
]
}
},
"aggs": {
"uniquezi": {
"terms": {
"field": "zi"
}
}
}
}
}
}
-------------------------------------------------------
output :
{"aggregations": {
"findunique": {
"doc_count": 4,
"uniquezi": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z502",
"doc_count": 1
},
{
"key": "z503",
"doc_count": 1
}
]
}
}
}
}}
Now i am looking to don't show zi =z101 and z201 should not come in list as that belonging to zt = bl and zt = un
Please suggest me Thanks !
As a suggestion you could try adding two aggregations with filer set on "zt" field.
This way you will get two sets and can extract all from "Wanted" which are not in "Unwanted" later in code.
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"wanted" : { "terms" : { "zt" : [ "g1", "g2" ] }},
"unwanted" : { "terms" : { "zt" : [ "bl", "un" ] }}
}
},
"aggs" : {
"monthly" : {
"terms": {"field" : "zi"}
}
}
}
}
}
Response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"messages": {
"buckets": {
"wanted": {
"doc_count": 4,
"distinctValuesAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z502",
"doc_count": 1
},
{
"key": "z503",
"doc_count": 1
}
]
}
},
"unwanted": {
"doc_count": 5,
"distinctValuesAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "z101",
"doc_count": 1
},
{
"key": "z102",
"doc_count": 1
},
{
"key": "z103",
"doc_count": 1
},
{
"key": "z201",
"doc_count": 1
},
{
"key": "z202",
"doc_count": 1
}
]
}
}
}
}
}
}