I have two fields in the search document such as salary_from and salary_to and want the aggregation of salary ranges such as 0 - 10 , 10 - 20, etc.
Is there any ways to set multiple fields to the Elastic Range Aggregation. (I can set one field by using setField function)
I just need to get the aggregated count of salary ranges or slabs by considering the two fields salary_from and salary_to.
Please help me.
If I understand your question correctly, below is what you need.
{
"size": 0,
"aggs": {
"salary_ranges": {
"terms": {
"script": "doc['salary_from'].value + ' to ' + doc['salary_to'].value",
"size": 0
}
}
}
}
It basically uses a script for Terms Aggregation. Read more about it here.
If say, you have 3 documents with salary_from set to 3 and salary_to set to 5 and then you have 2 documents with salary_from set to 10 and salary_to set to 25, the result of the query above will look something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"salary_ranges": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "3 to 5",
"doc_count": 3
},
{
"key": "10 to 25",
"doc_count": 2
}
]
}
}
}
Related
I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!
I have an ElasticSearch index full of legacy log data that I want to bucket by hour to get an idea of when the most active times were for the data. The date_histogram aggregation seemed like it would be perfect for this, but I'm having a problem figuring out how to get the aggregation to make more than 5 buckets.
The index has about 725 million documents in it, spanning about 7 or 8 months so that should be several thousand buckets by hour but when I use the following query body I only get back 5 buckets
{
"query":{
"match_all":{}
},
"aggs":{
"events_per_hour":{
"date_histogram":{
"field":"timestamp",
"interval":"hour"
}
}
}
}
And the results seem to span about the right time period, but it forces it into 5 buckets instead of the several thousand I was expecting
{
"took": 276509,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 726450222,
"max_score": 0,
"hits": []
},
"aggregations": {
"events_per_hour": {
"buckets": [
{
"key_as_string": "1970-01-18T13:00:00.000Z",
"key": 1515600000,
"doc_count": 51812791
},
{
"key_as_string": "1970-01-18T14:00:00.000Z",
"key": 1519200000,
"doc_count": 130819007
},
{
"key_as_string": "1970-01-18T15:00:00.000Z",
"key": 1522800000,
"doc_count": 188046057
},
{
"key_as_string": "1970-01-18T16:00:00.000Z",
"key": 1526400000,
"doc_count": 296038311
},
{
"key_as_string": "1970-01-18T17:00:00.000Z",
"key": 1530000000,
"doc_count": 59734056
}
]
}
}
}
I tried to google for the issue, but it looks like the size parameter that you can add to terms aggregations but that's not available for the histograms apparently and I tried to change the search.max_buckets setting but that didn't work either.
Is there any way to get ES to split this data into the thousands of buckets I need? Or do I have to write something that just downloads all of the data and splits it manually in memory?
If you translate the "key_as_string" (1970-01-18T13:00:00.000) from the date to epoch, you'll see:
Epoch timestamp: 1515600
Timestamp in milliseconds: 1515600000
And if you translate 1515600000 in epoch to date you'll receive right date (Wednesday, January 10, 2018 4:00:00 PM)
So, look like you send the epoch, but in the date format of the field defined milliseconds.
Running this search in kibana:
GET myindex/mytype/_search
{"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "myfield",
"min_doc_count": 2
}
}
}
}
Gives:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 12,
"successful": 12,
"failed": 0
},
"hits": {
"total": 46117,
"max_score": 0,
"hits": []
},
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 12,
"sum_other_doc_count": 45817,
"buckets": []
}
}
}
I'm not sure how to interpret this result. Per the term aggregation documentation sum_other_doc_count means:
when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response
Since there are no buckets in the response, it seems odd that there are apparently buckets that were not included. Does sum_other_doc_count include the buckets that were excluded by min_doc_count, and therefore the result can be interpreted as meaning there were no documents with duplicates for myfield?
If the latter is the case, assuming there were buckets returned, is is possible to get a some_other_doc_count that does not include the min_doc_count excluded buckets, or a total count of the buckets?
Update:
It seems that I can get some of the information I want via a cardinality aggregation. Total records - field cardinality provides the approximate number of documents with a duplicate field.
Here is what i have in one of my columns
so, all of these values add up to 1.272. Now i tried to create a metric visualization for it but i get
why is it 0? The field is of type number in the index.
Update
So i tried to run this in sense
post indexName/_search
{
"size": 0,
"aggs": {
"sum block": {
"sum": {
"field": "blockSize"
}
}
}
}
}
}
and i get
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 12,
"max_score": 0,
"hits": []
},
"aggregations": {
"sum block": {
"value": 0
}
}
}
why is this happening? Should it not add up the float values? also, in the index mapping
"blockSize": {
"type": "long"
}
shouldn't this be float or double? and if it is long, then why does it store a decimal point with the values?
Probably that the first document that was indexed had blockSize: 0 and thus the long type was chosen by ES to map that field. Now, float values are stored but 0 is indexed (since it's a long).
You need to wipe your index, correct the mapping and re-index your data.
I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.
For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.
Terms aggregation is what you are looking for.