I am trying to figure something out :
Here's an example of a document that contains object properties, and then trying to do simple terms aggregations.
https://gist.github.com/BAmine/80e1be219d2ac272561a
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"buckets": [
{
"key": "canine",
"doc_count": 1,
"test2": {
"buckets": [
{
"key": "cat",
"doc_count": 1
},
{
"key": "dog",
"doc_count": 1
},
{
"key": "tiger",
"doc_count": 1
},
{
"key": "wolf",
"doc_count": 1
}
]
}
},
{
"key": "feline",
"doc_count": 1,
"test2": {
"buckets": [
{
"key": "cat",
"doc_count": 1
},
{
"key": "dog",
"doc_count": 1
},
{
"key": "tiger",
"doc_count": 1
},
{
"key": "wolf",
"doc_count": 1
}
]
}
}
]
}
}
}
The question is : How can I avoid getting, in my sub-aggregations, buckets whose keys do not belong to the parent aggregation's keys ( example : cat and tiger are not in the property whose label is canine) ?
Is there a way to do this without using nested properties ?
Thank you !
To have this work with the data as is; you could set the animals field's type to nested:
"animals":{
"type": "nested",
"properties": {
"label" : { "type" : "string"},
"names":{
"properties":{
"label" : {"type" : "string"}
}
}
}
This allows you to make requests of that part of the document as separate objects. You could then use two filter aggregations within nested aggregations, one filtering for label == feline and the other for label == canine, you could then use aggregations within these that would give you the two separate lists.
This solution would have the drawback of having to add another nested filter aggregation for each new class of animals you add later.
the solution #vadik suggested seems superior to me, as there doesn't seem to be anything about these lists that requires them to be in the same document. If there is, you could make them be in separate documents with a common parent.
The problem is that you have one document for both the animals. And that's why you will get all the four animals. I suggest another approach instead. Create 2 documents, 1 each for every item in the animals array.
And try on similar lines and you will get your result.
Why are you not getting the result?
Aggregations framework gets the only document, and finds the occurrences of animals.label. It finds two, canine and feline and it outputs both. Further, there is another aggregation within the previous aggregation that wants to aggregate the key animals.names.label. Now there is just one document, which has both the keys of type animals.label and then for each key, the document has all the four values for key animals.names.label. So ES is right. The problem is that the item in animals must be independently identifiable as a document. And then aggs framework will be able to consider it as a container and your intention to aggregate animals.names.label inside animals.label. This is exactly what will happen when you will split the document into two documents.
Another thing that you can try is working with Nested Types. To understand why nested types may help, read this article.
Related
I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!
I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!
I am running an elastic search query with aggregation, which I intend to limit to say 100 records. The problem is that even when I apply the "size" filter, there is no effect on the aggregation.
GET /index_name/index_type/_search
{
"size":0,
"query":{
"match_all": {}
},
"aggregations":{
"courier_code" : {
"terms" : {
"field" : "city"
}
}
}}
The result set is
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 10867,
"max_score": 0,
"hits": []
},
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mumbai",
"doc_count": 2706
},
{
"key": "London",
"doc_count": 2700
},
{
"key": "Patna",
"doc_count": 1800
},
{
"key": "New York",
"doc_count": 1800
},
{
"key": "Melbourne",
"doc_count": 900
}
]
}
}
}
As you can see there is no effect on limiting the records on which the aggregation is to be performed. Is there a filter for, say top 100 records in Elastic Search.
Search operations in elasticsearch are performed in two phases query and fetch. During the first phase elasticsearch obtains results from all shards sorts them and determines which records should be returned. These records are then retrieved during the second phase. The size parameter controls the number of records that are returned to you in the response. Aggregations are executed during the first phase before elasticsearch actually knows which records will need to be retrieved and they are always executed on all records in the search. So, it's not possible to limit it by the total number of results. If you want to limit the scope of aggregation execution you need to limit the search query instead changing retrieval parameter. For example, if you add a filter to your search query that will only include records from the last year, aggregations will be executed on this filter.
It's also possible to limit the number of records that are analyzed on each shard using terminate_after parameter, however you will have no control on which records will be included and which records wouldn't be included into results, so this option is most likely not what you want.
I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.
For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.
Terms aggregation is what you are looking for.
For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?
Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}