Term aggregation on filtered array items - elasticsearch

I want to aggregate on term that are inside an array but I am only interested in some of the array item. I made up a simplified example. Basically I want to aggregate on Type.string if Type.field is valid.
POST so/question
{
"Type": [
[
{
"field": "invalid",
"string": "A"
}
],
[
{
"field": "valid",
"string": "B"
}
]
]
}
GET /so/_search
{
"size": 0,
"aggs": {
"xxx": {
"filter": {
"term": {
"Type.field": "valid"
}
},
"aggs": {
"yyy": {
"terms": {
"field": "Type.string.keyword",
"min_doc_count": 0
}
}
}
}
}
}
The agregation result has 2 keys whereas I only need the "B" key.
"aggregations": {
"xxx": {
"doc_count": 1,
"yyy": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A",
"doc_count": 1
},
{
"key": "B",
"doc_count": 1
}
]
}
}
}
Is there a way to aggregate on array items which match the filter?
Unfortunately I can't change the data format which would be the obvious solution.

Unless, the documents are of Nested Type, I don't think its possible with simple array types because of the way Elasticsearch Flattens the objects and stores them.
Querying anything on these flattened objects will give you completely unexpected results.
Now I've come up with the below query, making use of Terms Aggregation using Script works perfectly fine for the document you've mentioned in the question
POST so/_search
{
"size": 0,
"aggs": {
"xxx": {
"filter": {
"term": {
"Type.field": "valid"
}
},
"aggs": {
"yyy": {
"terms": {
"script": {
"source": """
int size = doc['Type.string.keyword'].values.length;
for(int i=0; i<size; i++){
String myString = doc['Type.string.keyword'][i];
if(myString.equals("B") && doc['Type.field.keyword'][i].equals("valid")){
return myString;
}
}""",
"lang": "painless"
}
}
}
}
}
}
}
However if you ingest the below document, you see that the aggregation response would be completely different. That is because, array types doesn't store each Type.field value and Type.string value in an ith location in their respective arrays.
POST so/question/2
{
"Type": [
[
{
"field": "valid",
"string": "A"
}
],
[
{
"field": "invalid",
"string": "B"
}
]
]
}
Notice even the below simple Bool query wouldn't work as expected and ends up displaying both the documents.
POST so/_search
{
"query": {
"bool": {
"must": [
{ "match": { "Type.field.keyword": "valid" }},
{ "match": { "Type.string.keyword": "B" }}
]
}
}
}
Hope it helps!

Related

How to return results from elasticsearch after a threshold match

I have two queries as follows:
The first query returns the count of all documents per domain.
The second query returns the count where a field is empty.
Later I filter it in my backend, such that, if for a domain the count of documents missing field value is more than a specific threshold then only consider them else ignore. Could these two queries be combined together, such that I could do the threshold comparison and then return the results.
The first query is as follows:
GET database/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"source": {
"value": "Web"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain_id"
}
}
}
}
The second query just applies a should filter as follows:
GET mapachitl/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"source": {
"value": "Web"
}
}
}
],
"should": [
{
"term": {
"address.city.keyword": {
"value": ""
}
}
},
{
"term": {
"address.zip.keyword": {
"value": ""
}
}
}
],
"minimum_should_match": 1
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain_id"
}
}
}
}
Can I only return those domains where the ratio of documents missing city or zip code is more than 25%? I read about scripting but not sure how can I use it here.

Elastic-search aggregate top 3 common result

My indexed data is of below structure, i want to aggregate top 3 most repeted productProperty, so top 3 most repeated productProperty will be there in aggregation result
[
{
productProperty: "material",
productValue:[{value: wood},{value: plastic}] ,
},
{
productProperty: "material",
productValuea:[{value: wood},{value: plastic}] ,
},
{
productProperty: "type",
productValue:[{value: 26A},{value: 23A}] ,
},
{
productProperty: "type",
productValue:[{value: 22B},{value: 90C}] ,
},
{
productProperty: "material",
productValue:[{value: wood},{value: plastic}] ,
},
{
productProperty: "age_rating",
productValue:[{value: 18},{value: 13}] ,
}
]
Below query aggregates all based on productProperty but how can i get top 3 results out of that
{
"query": {},
"aggs": {
"filtered_product_property": {
"filter": {
"bool": {
"must": []
}
},
"aggs": {
"aggs": {
"productProperty": {
"terms": {
"field": "productProperty"
}
}
}
}
}
}
}
You can use the size parameter in your term aggregation.
{
"query": {},
"aggs": {
"filtered_product_property": {
"filter": {
"bool": {
"must": []
}
},
"aggs": {
"aggs": {
"productProperty": {
"terms": {
"field": "productProperty",
"size" : 3
}
}
}
}
}
}
}
Important to point out, that terms aggregations are not the most accurate in some cases.
As mentioned by #Tushar you can use the size param. According to the ES official documentation
when there are lots of unique terms, Elasticsearch only returns the
top terms; this number is the sum of the document counts for all
buckets that are not part of the response
However, you can define the order in which the sorting of the results should be done of the aggregation response, using the order param.
By default, the result is sorted on the basis of doc count in descending order
Search Query will be
{
"aggs": {
"productProperty": {
"terms": {
"field": "productProperty.keyword",
"size": 3
}
}
}
}
And, search result would be
"aggregations": {
"productProperty": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "material",
"doc_count": 3
},
{
"key": "type",
"doc_count": 2
},
{
"key": "age_rating",
"doc_count": 1
}
]
}
}

Filtering aggregation results

This question is a subquestion of this question. Posting as a separate question for attention.
Sample Docs:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Ask: To get products belonging to a particular category. e.g cat_id = 3
Query:
GET product/_search
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cats",
"size": 10
},"aggs": {
"products": {
"terms": {
"field": "name.keyword",
"size": 10
}
}
}
}
}
}
Question:
How to filter the aggregated result for cat_id = 3 here. I tried bucket_selector as well but it is not working.
Note: Due to multi-value of cat_ids filtering and then aggregation isn't working
You can filter values, on the basis of which buckets will be created.
It is possible to filter the values for which buckets will be created.
This can be done using the include and exclude parameters which are
based on regular expression strings or arrays of exact values.
Additionally, include clauses can filter using partition expressions.
Adding a working example with index data, search query, and search result
Index Data:
{
"id":1,
"product":"p1",
"cat_ids":[1,2,3]
}
{
"id":2,
"product":"p2",
"cat_ids":[3,4,5]
}
{
"id":3,
"product":"p3",
"cat_ids":[4,5,6]
}
Search Query:
{
"size": 0,
"aggs": {
"cats": {
"terms": {
"field": "cat_ids",
"include": [ <-- note this
3
]
},
"aggs": {
"products": {
"terms": {
"field": "product.keyword",
"size": 10
}
}
}
}
}
}
Search Result:
"aggregations": {
"cats": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"products": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "p1",
"doc_count": 1
},
{
"key": "p2",
"doc_count": 1
}
]
}
}
]
}

Filetered aggregation query ouput needed in non-nested format

I have following query which gives the desired output, but in nested format.
{
"size": 0,
"aggs": {
"Pre_Post": {
"filters": {
"filters": {
"PRE": {
"range": {
"mydate": {
"gte": "2017-12-31||-6M",
"lte": "2017-12-31"
}
}
},
"POST": {
"range": {
"mydate": {
"gte": "2018-08-01",
"lte": "2018-08-07"
}
}
}
}
},
"aggs": {
"dimension1": {
"terms": {
"field": "myType.keyword"
},
"aggs": {
"sales": {
"sum": {
"field": "sales"
}
}
}
}
}
}
}
}
output of above is roughly in format
"PRE_POST": {
"PRE": {
"buckets": {
"dimension1": {
"key": "field1",
"buckets": {
"sales": 50
}
}
}
}
}
Is there any way to get this in non-nested format something like the one given by composite query, with some less nested-ness.
desired sample output something like
"PRE_POST": {
"Key1": "PRE",
"dimension1": "field1",
"buckets": {
"sales": 50
}
}
I have tried composite, but composite do not allow filters.
I have tried composite with PRE_POST as script field, but i.e. very slow.
I have also tried adjacency matrix, where two filters are for PRE and POST and others are for each dimension1 field. But this returns too much of unnecessary data.
Is there any way or any method I am missing to get the output in less nested format.

Dividing counts of two different queries in kibana

I am trying to create a lucene expression for displaying division on counts of two queries. Both queries contain textual information and both results are in message field. I am not sure how to write this correctly. So far what i have done is without any luck -
doc['message'].value/doc['message'].value
for first query message contain text as - "404 not found"
for second query message contain text as - "500 error"
what i want to do is count(404 not found)/count(500 error)
I would appreciate any help.
I'm going to add the disclaimer that it would be significantly cleaner to just run two separate counts and perform the calculation on the client side like this:
GET /INDEX/_search
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
Which would return something like (except using your distinct keys instead of the types in my example):
"aggregations": {
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Article",
"doc_count": 881
},
{
"key": "Page",
"doc_count": 301
}
]
}
Using that, take your distinct counts and calculated the average.
With the above being stated, here is the hacky way I was able to put together from (via single request) this
GET /INDEX/_search
{
"size": 0,
"aggs": {
"parent_agg": {
"terms": {
"script": "'This approach is a weird hack'"
},
"aggs": {
"four_oh_fours": {
"filter": {
"term": {
"message": "404 not found"
}
},
"aggs": {
"count": {
"value_count": {
"field": "_index"
}
}
}
},
"five_hundreds": {
"filter": {
"term": {
"message": "500 error"
}
},
"aggs": {
"count": {
"value_count": {
"field": "_index"
}
}
}
},
"404s_over_500s": {
"bucket_script": {
"buckets_path": {
"four_oh_fours": "four_oh_fours.count",
"five_hundreds": "five_hundreds.count"
},
"script": "return params.four_oh_fours / (params.five_hundreds == 0 ? 1: params.five_hundreds)"
}
}
}
}
}
}
This should return an aggregate value based on the calculation within the script.
If someone can offer an approach aside from these two, I would love to see it. Hope this helps.
Edit - Same script done via "expression" type rather than painless (default). Just replace the above script value with the following:
"script": {
"inline": "four_oh_fours / (five_hundreds == 0 ? 1 : five_hundreds)",
"lang": "expression"
}
Updated the script here to accomplish the same thing via Lucene expressions

Resources