elasticsearch how to find number of occurrences - elasticsearch

I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.

For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.

Terms aggregation is what you are looking for.

Related

Elasticsearch aggregation on values in nested list (array)

I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!

Why does ElasticSearch only return 5 buckets for dat_histogram aggregations?

I have an ElasticSearch index full of legacy log data that I want to bucket by hour to get an idea of when the most active times were for the data. The date_histogram aggregation seemed like it would be perfect for this, but I'm having a problem figuring out how to get the aggregation to make more than 5 buckets.
The index has about 725 million documents in it, spanning about 7 or 8 months so that should be several thousand buckets by hour but when I use the following query body I only get back 5 buckets
{
"query":{
"match_all":{}
},
"aggs":{
"events_per_hour":{
"date_histogram":{
"field":"timestamp",
"interval":"hour"
}
}
}
}
And the results seem to span about the right time period, but it forces it into 5 buckets instead of the several thousand I was expecting
{
"took": 276509,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 726450222,
"max_score": 0,
"hits": []
},
"aggregations": {
"events_per_hour": {
"buckets": [
{
"key_as_string": "1970-01-18T13:00:00.000Z",
"key": 1515600000,
"doc_count": 51812791
},
{
"key_as_string": "1970-01-18T14:00:00.000Z",
"key": 1519200000,
"doc_count": 130819007
},
{
"key_as_string": "1970-01-18T15:00:00.000Z",
"key": 1522800000,
"doc_count": 188046057
},
{
"key_as_string": "1970-01-18T16:00:00.000Z",
"key": 1526400000,
"doc_count": 296038311
},
{
"key_as_string": "1970-01-18T17:00:00.000Z",
"key": 1530000000,
"doc_count": 59734056
}
]
}
}
}
I tried to google for the issue, but it looks like the size parameter that you can add to terms aggregations but that's not available for the histograms apparently and I tried to change the search.max_buckets setting but that didn't work either.
Is there any way to get ES to split this data into the thousands of buckets I need? Or do I have to write something that just downloads all of the data and splits it manually in memory?
If you translate the "key_as_string" (1970-01-18T13:00:00.000) from the date to epoch, you'll see:
Epoch timestamp: 1515600
Timestamp in milliseconds: 1515600000
And if you translate 1515600000 in epoch to date you'll receive right date (Wednesday, January 10, 2018 4:00:00 PM)
So, look like you send the epoch, but in the date format of the field defined milliseconds.

Elastic Search Limiting the records that are aggregated

I am running an elastic search query with aggregation, which I intend to limit to say 100 records. The problem is that even when I apply the "size" filter, there is no effect on the aggregation.
GET /index_name/index_type/_search
{
"size":0,
"query":{
"match_all": {}
},
"aggregations":{
"courier_code" : {
"terms" : {
"field" : "city"
}
}
}}
The result set is
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 10867,
"max_score": 0,
"hits": []
},
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mumbai",
"doc_count": 2706
},
{
"key": "London",
"doc_count": 2700
},
{
"key": "Patna",
"doc_count": 1800
},
{
"key": "New York",
"doc_count": 1800
},
{
"key": "Melbourne",
"doc_count": 900
}
]
}
}
}
As you can see there is no effect on limiting the records on which the aggregation is to be performed. Is there a filter for, say top 100 records in Elastic Search.
Search operations in elasticsearch are performed in two phases query and fetch. During the first phase elasticsearch obtains results from all shards sorts them and determines which records should be returned. These records are then retrieved during the second phase. The size parameter controls the number of records that are returned to you in the response. Aggregations are executed during the first phase before elasticsearch actually knows which records will need to be retrieved and they are always executed on all records in the search. So, it's not possible to limit it by the total number of results. If you want to limit the scope of aggregation execution you need to limit the search query instead changing retrieval parameter. For example, if you add a filter to your search query that will only include records from the last year, aggregations will be executed on this filter.
It's also possible to limit the number of records that are analyzed on each shard using terminate_after parameter, however you will have no control on which records will be included and which records wouldn't be included into results, so this option is most likely not what you want.

Range aggregation on multiple field

I have two fields in the search document such as salary_from and salary_to and want the aggregation of salary ranges such as 0 - 10 , 10 - 20, etc.
Is there any ways to set multiple fields to the Elastic Range Aggregation. (I can set one field by using setField function)
I just need to get the aggregated count of salary ranges or slabs by considering the two fields salary_from and salary_to.
Please help me.
If I understand your question correctly, below is what you need.
{
"size": 0,
"aggs": {
"salary_ranges": {
"terms": {
"script": "doc['salary_from'].value + ' to ' + doc['salary_to'].value",
"size": 0
}
}
}
}
It basically uses a script for Terms Aggregation. Read more about it here.
If say, you have 3 documents with salary_from set to 3 and salary_to set to 5 and then you have 2 documents with salary_from set to 10 and salary_to set to 25, the result of the query above will look something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"salary_ranges": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "3 to 5",
"doc_count": 3
},
{
"key": "10 to 25",
"doc_count": 2
}
]
}
}
}

Elasticsearch highlighting - not working

I have the following problem:
I'm doing some tests with facetings
My script is as follows:
https://gist.github.com/nayelisantacruz/6610862
the result I get is as follows:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": []
},
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "JavaScript",
"count": 1
},
{
"term": "Java Platform, Standard Edition",
"count": 1
}
]
}
}
}
which is fine, but the problem is that I can not display the "highlighting"
I was expecting a result like the following:
..........
..........
..........
"facets": {
"title": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "<b>Java</b>Script",
"count": 1
},
{
"term": "<b>Java</b> Platform, Standard Edition",
"count": 1
}
]
}
}
..........
..........
..........
Anyone can help me and tell me what I'm doing wrong or what I'm missing, please
Thank you very much for your attention
Faceting and highlighting are two completely different things. Highlighting works together with search, in order to return highlighted snippets for each of the search results.
Faceting is a completely different story, as a facet effectively looks at all the terms that have been indexed for a specific field, throughout all the documents that match the main query. In that respect, the query only controls the documents that are going to be taken into account to perform faceting. Only the top terms (by default with higher count) are going to be returned. Those terms are not only related to the search results (by default 10) but to all the documents that match the query.
That said, the terms returned with the facets are never highlighted.
If you use highlighting you should see in your response, as mentioned in the reference, a new section that contains the highlighted snippets for each of your search results. The reason why you don't see it is that you are querying the title.autocomplete field, but you make highlighting on the title field with require_field_match enabled. You either have to set require_field_match to true or highlight the same field that you are querying on. But again this is not related to faceting whatsoever.
Note the use of * instead of _all. This works like a charm at all level of nesting:
POST 123821/Encounters/_search
{
"query": {
"query_string": {
"query": "Aller*"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}

Resources