How to get all unique tags from 2 collections in Elasticsearch? - elasticsearch

I have a set of tags stored in document.tags and document.fields.articleTags.
This is how I get all the tags from both namespaces, but how can I get the result merged into one array in the response from ES?
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"tags" : {
"terms" : { "field" : "tags" }
},
"articleTags" : {
"terms" : { "field" : "articleTags" }
}
}
}
Result
I get the tags listed in articleTags.buckets and tags.buckets. Is it possible to have the result delivered in one bucket?
{
"aggregations": {
"articleTags": {
"buckets": [
{
"key": "halloween"
}
]
},
"tags": {
"buckets": [
{
"key": "news"
}

Yes, you can using a single terms aggregation with a script instead that would "join" the two arrays (i.e. add them together), it goes like this:
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"all_tags" : {
"terms" : { "script" : "doc.tags.values + doc.articleTags.values" }
}
}
}
Note that you need to make sure to enable dynamic scripting in order for this query to work.

Related

How can I aggregate the whole field value in Elasticsearch

I am using Elasticsearch 7.15 and need to aggregate a field and sort them by order.
My document saved in Elasticsearch looks like:
{
"logGroup" : "/aws/lambda/myLambda1",
...
},
{
"logGroup" : "/aws/lambda/myLambda2",
...
}
I need to find out which logGroup has the most document. In order to do that, I tried to use aggregate in Elasticsearch:
GET /my-index/_search?size=0
{
"aggs": {
"types_count": {
"terms": {
"field": "logGroup",
"size": 10000
}
}
}
}
the output of this query looks like:
"aggregations" : {
"types_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "aws",
"doc_count" : 26303620
},
{
"key" : "lambda",
"doc_count" : 25554470
},
{
"key" : "myLambda1",
"doc_count" : 25279201
}
...
}
As you can see from above output, it splits the logGroup value into terms and aggregate based on term not the whole string. Is there a way for me to aggregate them as a whole string?
I expect the output looks like:
"buckets" : [
{
"key" : "/aws/lambda/myLambda1",
"doc_count" : 26303620
},
{
"key" : "/aws/lambda/myLambda2",
"doc_count" : 25554470
},
The logGroup field in the index mapping is:
"logGroup" : {
"type" : "text",
"fielddata" : true
},
Can I achieve it without updating the index?
In order to get what you expect you need to change your mapping to this:
"logGroup" : {
"type" : "keyword"
},
Failing to do that, your log groups will get analyzed by the standard analyzer which splits the whole string and you'll not be able to aggregate by full log groups.
If you don't want or can't change the mapping and reindex everything, what you can do is the following:
First, add a keyword sub-field to your mapping, like this:
PUT /my-index/_mapping
{
"properties": {
"logGroup" : {
"type" : "text",
"fields": {
"keyword": {
"type" : "keyword"
}
}
}
}
}
And then run the following so that all existing documents pick up this new field:
POST my-index/_update_by_query?wait_for_completion=false
Finally, you'll be able to achieve what you want with the following query:
GET /my-index/_search
{
"size": 0,
"aggs": {
"types_count": {
"terms": {
"field": "logGroup.keyword",
"size": 10000
}
}
}
}

ElasticSearch: Sort Aggregations by Filtered Average

I have an ElasticSearch index with documents structured like this:
"created": "2019-07-31T22:44:41.437Z",
"id": "2956",
"rating": 1
If I wish to create an aggregation of the id fields which is sorted on the average of the rating, that could be handled by:
{
"aggs" : {
"sorted" : {
"terms" : {
"field" : "id",
"order" : { "sort" : "asc" }
},
"aggs" : {
"sort" : {
"avg" : {
"field" : "rating"
}
}
}
}
}
}
However, I'm looking to only factor in documents which have a created value that was within the last week (and then take the average of those rating fields).
My naive thoughts on this would be to apply a filter or range within the sort aggregation, but an aggregation cannot have multiple types, and looking through the avg documentation, I don't see a means to put it in the avg. Optimistically attempting to put range fields in the avg regardless of what the documentation says yielded no results (as expected).
How would I go about achieving this?
Try adding a bool query to the body with a range query:
{
query:
bool: {
must: {
"range": {
"created_time": {
"gte": one_week_ago,
}
}
}
}
},
{
"aggs" : {
"sorted" : {
"terms" : {
"field" : "id",
"order" : { "sort" : "asc" }
},
"aggs" : {
"sort" : {
"avg" : {
"field" : "rating"
}
}
}
}
}
}
and you can query for dynamic dates like this
as Tom referred but use "now-7d/d"
{
query:
bool: {
must: {
"range": {
"created_time": {
"gte": "now-7d/d"
}
}
}
}
}

Converting SQL query to ElasticSearch Query

I want to convert the following sql query to Elasticsearch one. can any one help in this.
select csgg, sum(amount) from table1
where type in ('a','b','c') and year=2016 and fc="33" group by csgg having sum(amount)=0
I tried following way:enter code here
{
"size": 500,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}}
],
"should" : [
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg"
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
but not sure if I am doing right as its not validating the results.
seems query to be added inside aggregation.
Assuming that you use Elasticsearch 2.x, there is a possibility to have the having-semantics in Elasticsearch.
I'm not aware of a possibility prior 2.0.
You can use the new Pipeline Aggregation Bucket Selector Aggregation, which only selects the buckets, which meet a certain criteria:
POST test/test/_search
{
"size": 0,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}},
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg",
"size": 100
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"no_amount_filter": {
"bucket_selector": {
"buckets_path": {"sumAmount": "sum_amount"},
"script": "sumAmount == 0"
}
}
}
}
}
}
However there are two caveats. Depending on your configuration, it might be necessary to enable scripting like that:
script.aggs: true
script.groovy: true
Moreover, as it works on the parent buckets it is not guaranteed that you get all buckets with amount = 0. If the terms aggregation selects only terms with sum amount != 0, you will have no result.

Elasticsearch, how to return unique values of two fields

I have an index with 20 different fields. I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
In SQL it would look this way: select unique cat, sub from table A;
I can do it for one field this way:
{
"size": 0,
"aggs" : {
"unique_set" : {
"terms" : { "field" : "cat" }
}
}}
but how do I add another field to check uniqueness across two fields?
Thanks,
SQL's SELECT DISTINCT [cat], [sub] can be imitated with a Composite Aggregation.
{
"size": 0,
"aggs": {
"cat_sub": {
"composite": {
"sources": [
{ "cat": { "terms": { "field": "cat" } } },
{ "sub": { "terms": { "field": "sub" } } }
]
}
}
}
}
Returns...
"buckets" : [
{
"key" : {
"cat" : "a",
"sub" : "x"
},
"doc_count" : 1
},
{
"key" : {
"cat" : "a",
"sub" : "y"
},
"doc_count" : 2
},
{
"key" : {
"cat" : "b",
"sub" : "y"
},
"doc_count" : 3
}
]
The only way to solve this are probably nested aggregations:
{
"size": 0,
"aggs" : {
"unique_set_1" : {
"terms" : {
"field" : "cats"
},
"aggregations" : {
"unique_set_2": {
"terms": {"field": "sub"}
}
}
}
}
}
Quote:
I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
This is nonsense; your question is unclear. You can have 10s unique pairs {cat, sub}, and 100s unique triplets {cat, sub, field_3}, and 1000s unique documents Doc{cat, sub, field3, field4, ...}.
If you are interested in document counts per unique pair {"Category X", "Subcategory Y"} then you can use Cardinality aggregations. For two or more fields you will need to use scripting which will come with performance hit.
Example:
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['cats'].value + ' _my_custom_separator_ ' + doc['sub'].value"
}
}
}
}
Alternate solution: use nested Terms terms aggregations.

Post filter on subaggregation in elasticsearch

I am trying to run a post filter on the aggregated data, but it is not working as i expected. Can someone review my query and suggest if i am doing anything wrong here.
"query" : {
"bool" : {
"must" : {
"range" : {
"versionDate" : {
"from" : null,
"to" : "2016-04-22T23:13:50.000Z",
"include_lower" : false,
"include_upper" : true
}
}
}
}
},
"aggregations" : {
"associations" : {
"terms" : {
"field" : "association.id",
"size" : 0,
"order" : {
"_term" : "asc"
}
},
"aggregations" : {
"top" : {
"top_hits" : {
"from" : 0,
"size" : 1,
"_source" : {
"includes" : [ ],
"excludes" : [ ]
},
"sort" : [ {
"versionDate" : {
"order" : "desc"
}
} ]
}
},
"disabledDate" : {
"filter" : {
"missing" : {
"field" : "disabledDate"
}
}
}
}
}
}
}
STEPS in the query:
Filter by indexDate less than or equal to a given date.
Aggregate based on formId. Forming buckets per formId.
Sort in descending order and return top hit result per bucket.
Run a subaggregation filter after the sort subaggregation and remove all the documents from buckets where disabled date is not null.(Which is not working)
The whole purpose of post_filter is to run after aggregations have been computed. As such, post_filter has no effect whatsoever on aggregation results.
What you can do in your case is to apply a top-level filter aggregation so that documents with no disabledDate are not taken into account in aggregations, i.e. consider only documents with disabledDate.
{
"query": {
"bool": {
"must": {
"range": {
"versionDate": {
"from": null,
"to": "2016-04-22T23:13:50.000Z",
"include_lower": true,
"include_upper": true
}
}
}
}
},
"aggregations": {
"with_disabled": {
"filter": {
"exists": {
"field": "disabledDate"
}
},
"aggs": {
"form.id": {
"terms": {
"field": "form.id",
"size": 0
},
"aggregations": {
"top": {
"top_hits": {
"size": 1,
"_source": {
"includes": [],
"excludes": []
},
"sort": [
{
"versionDate": {
"order": "desc"
}
}
]
}
}
}
}
}
}
}
}

Resources