how can i get count of elements when using distinct a filed in Elasticsearch ? i want get total elements of index when distinct one of field . i can use these codes for search :
**POST myIndex/_search**
{
"size": 0,
"aggs": {
"myField": {
"terms": {
"field": "name’s of my field",
"size": 10000
}
}
}
.
.
.
}
but , I want query similar to :
**GET myIndex/_count**
{
"size": 0,
"aggs": {
"myField": {
"terms": {
"field": "name’s of my field",
"size": 10000
}
}
}
.
.
.
}
but return error :
**{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "request does not support [size]",
"line" : 2,
"col" : 3
}
],
"type" : "parsing_exception",
"reason" : "request does not support [size]",
"line" : 2,
"col" : 3
},
"status" : 400
}**
so i interested a solution a bout this problem .
Elastic search only supports approximate Distinct using cardinality aggregation
{
"aggs": {
"distinct_count": {
"cardinality": {
"field": "field-name"
}
}
}
}
values are approximate. Though you can increase precision using precision_threshold
Related
We are using Elasticsearch 7.*, and I'm trying to take a sample. It returns far more than 10,000 results, which is the max hits a query can return. In order to paginate with search_after, I need to sort the items by #timestamp (_id sorting will be deprecated soon).
Here's my current query:
GET /my-index-pattern/_search
{
"query": {
"range": {
"#timestamp": {
"gte": "now-1M",
"lte": "now"
}
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 40000
},
"aggs": {
"group_by_my_grouping_field": {
"terms": {
"field": "my_grouping_field.keyword",
"size": 10000
}
}
}
}
},
"sort": [
"#timestamp"
]
}
Returning:
"_shards" : {
"total" : 55,
"successful" : 55,
"skipped" : 43,
"failed" : 0
},
However, this takes a long time. I think it's sorting before doing the sample, which also affects my methodology. It's also skipping something?
Is there a way to sort within the sample?
I tried:
...
"sample": {
"sampler": {
"shard_size": 40000
},
"aggs": {
"group_by_my_grouping_field": {
"terms": {
"field": "my_grouping_field.keyword",
"size": 10000
}
},
"search_after_sort":
{
"bucket_sort": {
"sort": ["#timestamp"]
}
}
}
}
...
But this just gives:
"error" : {
"root_cause" : [
{
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: No aggregation found for path [#timestamp];"
}
],
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: No aggregation found for path [#timestamp];"
},
"status" : 400
enter code here
This happens for all fields, like message and _id, not just on #timestamp.
I've got the following elastic search query in order to get the number of product sales per hour grouped by product id and hour of sale.
POST /my_sales/_search?size=0
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
}
}
}
}
}
}
One example of data :
{
"#timestamp" : "2020-10-29T18:09:56.921Z",
"name" : "my-beautifull_product",
"event_time" : "2020-10-17T08:01:33.397Z"
}
This query returns several buckets (one per hour and per product) but i would like to only retrieve those who have a doc_count higher than 10 for example, is it possible ?
For those results i would like to know the id of the product and the event_time bucket.
Thanks for your help.
Perhaps using the Bucket Selector feature will help on filtering out the results.
Try out this below search query:
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 10"
}
}
}
}
}
}
}
}
It will filter out all the documents, whose count is greater than 10 based on "params.the_doc_count > 10"
Thank you for your help this is not far from what i would like but not exactly ; with the bucket selector i have something like this :
"aggregations" : {
"sales_per_hour" : {
"buckets" : [
{
"key_as_string" : "2020-08-31:23:00",
"key" : 1598914800000,
"doc_count" : 16,
"sales_per_hour_per_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "my_product_1",
"doc_count" : 2
},
{
"key" : "my_product_2",
"doc_count" : 2
},
{
"key" : "myproduct_3",
"doc_count" : 12
}
]
}
}
]
}
And sometimes none of the buckets are greater than 10, is it possible to have the same thing but with the filter on _count applied to the second level aggregation (sales_per_hour_per_product) and not on the first level (sales_per_hour) ?
I am new to Elastic Search and am trying to make a query with Metric aggregation for my docs. But when I add the field: min_doc_count=1 for my sum metric aggregation, I get an error:
`
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "[sum] unknown field [min_doc_count], parser not found"
}
],
"type": "illegal_argument_exception",
"reason": "[sum] unknown field [min_doc_count], parser not found"
},
"status": 400
}
`
What am I missing here?
`
{
"aggregations" : {
"myKey" : {
"sum" : {
"field" : "field1",
"min_doc_count": 1
}
}
}
}
`
I'm not sure why/where you have the sum keyword?
The idea of min_doc_count is to make sure buckets returned by a given aggs query contain at least N documents, the example below would only return subject buckets for subjects that appear in 10 or more documents.
GET _search
{
"aggs" : {
"docs_per_subject" : {
"terms" : {
"field" : "subject",
"min_doc_count": 10
}
}
}
}
So with that in mind, yours would refactor to the following... Although when setting min_doc_count to 1, it's not really necessary to keep the parameter at all.
GET _search
{
"aggs" : {
"docs_per_subject" : {
"terms" : {
"field" : "field1",
"min_doc_count": 1
}
}
}
}
If you wish to sum only non-zero values of field you can filter those zero-values out in a query section:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"field": {
"gt": 0
}
}
}
]
}
},
"aggregations": {
"myKey": {
"sum": {
"field": "field1"
}
}
}
}
See Bool Query and Range Term
I have an index of messages where I store messageHash for each message too. I also have many more fields along with them. There are multiple duplicate message fields in the index e.g. "Hello". I want to retrieve unique messages.
Here is the query I wrote to search unique messages and sort them by date. I mean the message with the latest date among all duplicates is what I want
to be returned.
{
"query": {
"bool": {
"must": {
"match_phrase": {
"message": "Hello"
}
}
}
},
"sort": [
{
"date": {
"order": "desc"
}
}
],
"aggs": {
"top_messages": {
"terms": {
"field": "messageHash"
},
"aggs": {
"top_messages_hits": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
},
"_score"
],
"size": 1
}
}
}
}
}
}
The problem is that it's not sorted by date. It's sorted by doc_count. I just get the sort values in the response, not the real sorted results. What's wrong? I'm now wondering if it is even possible to do it.
EDIT:
I tried subsituting "terms" : { "field" : "messageHash", "order" : { "mydate" : "desc" } } , "aggs" : { "mydate" : { "max" : { "field" : "date" } } } for "terms": { "field": "messageHash" } but I get:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
}
],
"type" : "parsing_exception",
"reason" : "Found two sub aggregation definitions under [top_messages]",
"line" : 1,
"col" : 412
},
"status" : 400
}
I want to convert the following sql query to Elasticsearch one. can any one help in this.
select csgg, sum(amount) from table1
where type in ('a','b','c') and year=2016 and fc="33" group by csgg having sum(amount)=0
I tried following way:enter code here
{
"size": 500,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}}
],
"should" : [
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg"
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
but not sure if I am doing right as its not validating the results.
seems query to be added inside aggregation.
Assuming that you use Elasticsearch 2.x, there is a possibility to have the having-semantics in Elasticsearch.
I'm not aware of a possibility prior 2.0.
You can use the new Pipeline Aggregation Bucket Selector Aggregation, which only selects the buckets, which meet a certain criteria:
POST test/test/_search
{
"size": 0,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}},
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg",
"size": 100
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"no_amount_filter": {
"bucket_selector": {
"buckets_path": {"sumAmount": "sum_amount"},
"script": "sumAmount == 0"
}
}
}
}
}
}
However there are two caveats. Depending on your configuration, it might be necessary to enable scripting like that:
script.aggs: true
script.groovy: true
Moreover, as it works on the parent buckets it is not guaranteed that you get all buckets with amount = 0. If the terms aggregation selects only terms with sum amount != 0, you will have no result.