Get count of distinct values for a field across all documents in elastic search - elasticsearch

I have a field
*slices.code *
in my Elasticsearch mapping. Slices is an array element and slices.code has various values like "ATF", "ORW", "HKL". Slices is not a nested type field. I want to avoid adding nested type to this field. In each document there could be multiple occurnces for slice.code = ATF/ORW. So I want to get all possible values of slice.code along with total occurence of each field value in all the documents. Something like this where HKL appeared in 2 documents but 3 number of times total
{
"key": "HKL",
"doc_count": 2,
"total": {
"value": 3
}
},
{
"key": "ATF",
"doc_count": 3,
"total": {
"value": 7
}
},
{
"key": "ORW",
"doc_count": 2,
"total": {
"value": 5
}
}
I tried using terms query, but with that i only get doc_count, i don't get total occurence of the field value with that query. Below is the terms query that i tried
{
"size": 0,
"aggs": {
"distinct_colors": {
"terms": {
"field": "slices.code.keyword",
"size": 65535
}
}
}
}
Output that i received:
"buckets": [
{
"key": "HKG",
"doc_count": 1
},
{
"key": "MNL",
"doc_count": 1
},
{
"key": "PVG",
"doc_count": 1
},
{
"key": "TPE",
"doc_count": 1
}
]

Related

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

query on result aggregation elasticsearch

I had imported millions data into elasticsearch. Mapping is following:
"_source": {
"mt": "w",
"hour": 1
}
i want to find number of hour's that have occured more than 5.
for exmple:
using terms aggregation i get following result :
"aggregations": {
"hours": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 7
},
{
"key": 4,
"doc_count": 5
},
{
"key": 5,
"doc_count": 2
}
]
}
}
How do i find count of hour's that occure more than 5.
in here it be 1 because only hour=1 is more than 5
you can use "min_doc_count": 5 in terms aggregation Elastic doc

Elastic Search Aggregation buckets, buckets by number of records

I am new to Elastic Search and I'm trying to create a request without a lot of success. Here is the use case:
Let's imagine I have 4 documents, which have an amount field:
[
{
"id": 541436748332,
"amount": 5,
"date": "2017-01-01"
},
{
"id": 6348643512,
"amount": 2,
"date": "2017-03-13"
},
{
"id": 343687432,
"amount": 2,
"date": "2017-03-14"
},
{
"id": 6457866181,
"amount": 7,
"date": "2017-05-21"
}
]
And here is the kind of result I'd like to get:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 9
}
]
}
}
}
As you can see, I want some kind of histogram, but instead of putting a date interval, I'd like to set a "document" interval. So here, that would be 2 documents per bucket, and the sum of the field amount of those two documents.
Does someone knows if that is even possible? That would also imply to sort the records by date for example, to get the wanted results
EDIT: Some more explanations on the use case:
The real use case is a line graph I'd like to print. But I want to make the X axis the number of sales, and in the Y the total amount $$$ of those sales. And I don't want to print thousands of dot on my graph, I want fewer dots, that's why I was hoping to deal with the buckets and the sums...
The example of response I gave is just the first step I want to achieve, the second step would be to add each field the one that is behind it:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 16
}
]
}
}
}
(7 = 5 + 2); (16 = 7 (from last result) + 2 + 7);
You can use histogram and sum aggregations, like this:
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "id",
"interval": 2,
"offset": 1
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
(offset 1 is required if you want the first bucket to start at 1 instead of at 0.) Then you'll get a response like this:
{
"aggregations": {
"prices": {
"buckets": [
{
"key": 1,
"doc_count": 2,
"total_amount": {
"value": 7
}
},
{
"key": 3,
"doc_count": 2,
"total_amount": {
"value": 9
}
}
]
}
}
}
Sorting is not required, because the default order is the order you want. However, there's also an order parameter in case you want a different ordering of the buckets.

Histogram aggregation OR something else?

Which aggregation should I use, when I want same functionality as Histogram, BUT with specify only number of buckets, instead of specify interval?
Something like: give me aggs for price, and split it to 5 buckets...
I don’t want to make min+max aggregation, then calculate 5 intervals before sending my query, because that means 1 extra query on server ... first ask for min+max, then send actual query.
STANDARD HISTOGRAM AGGS QUERY:
"aggs":{
"prices":{
"histogram": {
"field": "variants.priceVat.d1",
"interval": 500
}
}
}
STANDARD RESULT (min 10, max 850 = 2 buckets, because interval is 500):
"prices": {
"doc_count": 67,
"prices": {
"buckets": [
{
"key": 10,
"doc_count": 56
},
{
"key": 500,
"doc_count": 13
}
]
}
}
WHAT I WANT (five buckets with automatic range min:10, max:850 = 1 bucket interval is 168):
"prices": {
"doc_count": 67,
"prices":{
"buckets": [
{
"key": 10,
"doc_count": 42
},
{
"key": 178,
"doc_count": 10
},
{
"key": 346,
"doc_count": 4
},
{
"key": 514,
"doc_count": 7
},
{
"key": 682,
"doc_count": 2
}
]
}
}

How to get the count of most frequent pattern in elasticsearch?

I want to get the ten most frequent patterns in search with elasticsearch .
Example :
"cgn:4189, dfsdkfldslfs"
"cgn:4210, aezfvdsvgds"
"cgn:4189, fdsmpfjdjs"
"cgn:4195, cvsf"
"cgn:4189, mkpjd"
"cgn:4210, mfsfgkpjd"
I want to get :
4189 : 3
4210 : 2
4195 : 1
I know how to do that in mysql or via awk/sort/head ... but with elasticsearch I'm lost.
Exactly how it will work depends on your analyzer, but if you are just using the default, standard analyzer, you can probably get what you want pretty easily with a terms aggregation.
As a simple example, I set up a trivial index:
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
Then indexed the data you posted, using the bulk api:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"msg":"cgn:4189, dfsdkfldslfs"}
{"index":{"_id":2}}
{"msg":"cgn:4210, aezfvdsvgds"}
{"index":{"_id":3}}
{"msg":"cgn:4189, fdsmpfjdjs"}
{"index":{"_id":4}}
{"msg":"cgn:4195, cvsf"}
{"index":{"_id":5}}
{"msg":"cgn:4189, mkpjd"}
{"index":{"_id":6}}
{"msg":"cgn:4210, mfsfgkpjd"}
Then I can run a simple terms aggregation to get back all the terms and how often they occur (ordered descending by term frequency by default):
POST /test_index/_search?search_type=count
{
"aggs": {
"msg_terms": {
"terms": {
"field": "msg"
}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"msg_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cgn",
"doc_count": 6
},
{
"key": "4189",
"doc_count": 3
},
{
"key": "4210",
"doc_count": 2
},
{
"key": "4195",
"doc_count": 1
},
{
"key": "aezfvdsvgds",
"doc_count": 1
},
{
"key": "cvsf",
"doc_count": 1
},
{
"key": "dfsdkfldslfs",
"doc_count": 1
},
{
"key": "fdsmpfjdjs",
"doc_count": 1
},
{
"key": "mfsfgkpjd",
"doc_count": 1
},
{
"key": "mkpjd",
"doc_count": 1
}
]
}
}
}
Here is the code I used:
http://sense.qbox.io/gist/a827095b675596c4e3d545ce963cde3fae932156

Resources