Elasticsearch aggregations on nested inner hits - elasticsearch

I got a large amount of data in Elasticsearch. My douments have a nested field called "records" that contains a list of objects with several fields.
I want to be able to query specific objects from the records list, and therefore I use the inner_hits field in my query, but It doesn't help because aggregation uses size 0 so no results are returned.
I didn't succeed to make an aggregation work only for inner_hits, as aggregation returns results for all the objects inside records no matter the query.
This is the query I am using:
(Each document has first_timestamp and last_timestamp fields, and each object in the records list has a timestamp field)
curl -XPOST 'localhost:9200/_msearch?pretty' -H 'Content-Type: application/json' -d'
{
"index":[
"my_index"
],
"search_type":"count",
"ignore_unavailable":true
}
{
"size":0,
"query":{
"filtered":{
"query":{
"nested":{
"path":"records",
"query":{
"term":{
"records.data.field1":"value1"
}
},
"inner_hits":{}
}
},
"filter":{
"bool":{
"must":[
{
"range":{
"first_timestamp":{
"gte":1504548296273,
"lte":1504549196273,
"format":"epoch_millis"
}
}
}
],
}
}
}
},
"aggs":{
"nested_2":{
"nested":{
"path":"records"
},
"aggs":{
"2":{
"date_histogram":{
"field":"records.timestamp",
"interval":"1s",
"min_doc_count":1,
"extended_bounds":{
"min":1504548296273,
"max":1504549196273
}
}
}
}
}
}
}'

Your query is pretty complex.
To be short, here is your requested query:
{
"size": 0,
"aggregations": {
"nested_A": {
"nested": {
"path": "records"
},
"aggregations": {
"bool_aggregation_A": {
"filter": {
"bool": {
"must": [
{
"term": {
"records.data.field1": "value1"
}
}
]
}
},
"aggregations": {
"reverse_aggregation": {
"reverse_nested": {},
"aggregations": {
"bool_aggregation_B": {
"filter": {
"bool": {
"must": [
{
"range": {
"first_timestamp": {
"gte": 1504548296273,
"lte": 1504549196273,
"format": "epoch_millis"
}
}
}
]
}
},
"aggregations": {
"nested_B": {
"nested": {
"path": "records"
},
"aggregations": {
"my_histogram": {
"date_histogram": {
"field": "records.timestamp",
"interval": "1s",
"min_doc_count": 1,
"extended_bounds": {
"min": 1504548296273,
"max": 1504549196273
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Now, let me explain every step by aggregations' names:
size: 0 -> we are not interested in hits, only aggregations
nested_A -> data.field1 is under records so we dive our scope to records
bool_aggregation_A -> filter by data.field1: value1
reverse_aggregation -> first_timestamp is not in nested document, we need to scope out from records
bool_aggregation_B -> filter by first_timestamp range
nested_B -> now, we scope again into records for timestamp field (located under records)
my_histogram -> finally, aggregate date histogram by timestamp field

Inner_hits aggregation is not supported by elasticsearch. The reason behind it is that inner_hits is a very expensive operation and applying aggregation on inner_hits is like exponential increase in complexity of operation.
Here is the github link of the issue.
If you want aggregation on inner_hits you can probably use the following approach:
Make flexible query where you only get the required hit from elastic and aggregate over it. Repeat it multiple time to get all the hits and aggregate simultaneously. This approach may lead you with multiple search query which is not advisable.
You can make your application layer handle the aggregation logic by writing smart aggregation parser and run those parser on response from elasticsearch. This approach is a little better but you have an overhead of maintaining the parser according to changing needs.
I would personally recommend you to change your data-mapping style in elasticsearch so that you are able to run aggregation on it.

You can also check the code like this
PUT records
{
"mappings": {
"properties": {
"records": {
"type": "nested"
}
}
}
}
POST records/_doc
{
"records": [
{
"data": "test1",
"value": 1
},
{
"data": "test2",
"value": 2
}
]
}
GET records/_search
{
"size": 0,
"aggs": {
"all_nested_count": {
"nested": {
"path": "records"
},
"aggs": {
"bool_aggs": {
"filter": {
"bool": {
"must": [
{
"term": {
"records.data": "test2"
}
}
]
}
},
"aggs": {
"filtered_aggs": {
"sum": {
"field": "records.value"
}
}
}
}
}
}
}
}
Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/inner-hits.html

Related

How to search for an array of terms, in elasticsearch?

Contextualizing: I have this query that I search for a term, in two fields, and the result should bring me items that resemble the one inserted in the wildcard. But eventually I'll get a list of search terms...
I use this query to search when I get only 1 string:
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"wildcard": {
"shortName": "BAN*"
}
},
{
"wildcard": {
"name": "BAN*"
}
}
]
}
},
{
"range": {
"dhCot": {
"gte": "2022-04-11T00:00:00.000Z",
"lt": "2022-04-12T00:00:00.000Z"
}
}
}
]
}
},
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "dtBuy",
"interval": "1H",
"format": "yyyy-MM-dd:HH:mm:ssZ"
},
"aggs": {
"documents": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But in some moments, I will get an array of strings, like this ["BANANA","APPLE","ORANGE"]
So, how do I search for items that exactly match the items within the array? Is it possible?
The object inserted in elastic is this one:
{
"name": "BANANA",
"priceDay": 1,
"priceWeek": 3,
"variation": 2,
"dataBuy":"2022-04-11T11:01:00.585Z",
"shortName": "BAN"
}
If you want to search for items that exactly match the items within the array, you can use the terms query
{
"query": {
"terms": {
"name": ["BANANA","APPLE","ORANGE"]
}
}
}
You can include the terms query, in your existing query either in the should clause or must clause depending on your use case.

Filter out terms aggregation buckets in elasticsearch after applying aggregation

Below is snapshot of the dataset:
recordNo employeeId employeeStatus employeeAddr
1 employeeA Permanent
2 employeeA ABC
3 employeeB Contract
4 employeeB CDE
I want to get the list of employees along with employeeStatus and employeeAddr.
So I am using terms aggregation on employeeId and then using sub-aggregations of employeeStatus and employeeAddr to get these details.
Below query returns the results correctly.
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
Now I want only the employees which are in Permanent status. So I am applying filter aggregation.
{
"aggregations": {
"filter_Employee_employeeID": {
"filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {"query": "Permanent"}
}
}
]
}
},
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
}
}
}
}
}
}
}
Now the problem is that the employeeAddr aggregation returns no buckets for employeeA because record 2 gets filtered out before the aggregation is done.
Assuming that I cannot modify the data set and I want to achieve the result with a single elastic query, how can I do it?
I checked the Bucket Selector pipeline aggregation but it only works for metric aggregations.
Is there a way to filter out term buckets after the aggregation is applied?
If I understood correctly you want to preserve the aggregations even if you use some kind of filter. To achieve that, try using the post_filter clause.
You can check the docs here
The clause is applied "outside" the aggregation. Using your example, it should look like this:
{
"aggregations": {
"filter_Employee_employeeID": {
"aggregations": {
"Employee": {
"terms": {
"field": "employeeID"
},
"aggregations": {
"employeeStatus": {
"terms": {
"field": "employeeStatus"
}
},
"employeeAddr": {
"terms": {
"field": "employeeAddr"
}
}
}
}
}
}
},
"post_filter": {
"bool": {
"must": [
{
"match": {
"employeeStatus": {
"query": "Permanent"
}
}
}
]
}
}
}
I tested a combination of the include field for the terms aggregation, plus using a bucket_selector with document count would give you the desired result.
Filtering term values is here.
Bucket selector using document count is here
the subtlety here is that, yes you need numeric values, but you can also reference meta/custom fields that elasticsearch has
{
"aggregations": {
"Employee": {
"terms": {
"field": "employeeId.keyword"
},
"aggregations": {
"employeeStatus": {
"terms": {"field": "employeeStatus", "include": "Permanent"}
},
"employeeAddr": {
"terms": {"field": "employeeAddr"}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "employeeStatus._bucket_count"
},
"script": {
"source": "params.count != 0"
}
}
}
}
}
}
}
I tested this on 7.10 and it worked, returning only employeeA, with the address included.

Elasticsearch scoped aggregation not desired results

I have the following query but the aggregation doesn't seem to be acting on top of the query.
The query returns 3 results there are 10 items in the aggregation. Looks like the aggregation is acting on top of all queried results.
Basically, how do I get the aggregation to take the given query as the input?
{
"query": {
"filtered": {
"filter": {
"and": [
{
"geo_distance": {
"coordinates": [
-79.3931,
43.6709
],
"distance": "15km"
}
},
{
"term": {
"user.type": "2"
}
}
]
},
"query": {
"match": {
"user.shoes": "314"
}
}
}
},
"aggs": {
"dedup": {
"terms": { "field": "user.id" }
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
So as it turns out, I was expecting the aggregation to act on the paginated results given by the query. And that's incorrect.
The aggregation takes as input "all results" of the query, not just the paginated one.

Aggregation, Query Context and filter Context not working in Elasticsearch 5.1

I am facing issue in migrating from elastic search 1.5 to 5.1.
Following is my elastic search - 1.5 Query:
{
"_source":["_id","spotlight"],
"query":{
"filtered":{
"filter":{
"and":[
{"term":{"gender":"female"}},
{"range":{"lastlogindate":{"gte":"2016-10-19 12:39:57"}}}
]
}
}
},
"filter":{
"and":[
{"term":{"maritalstatus":"1"}}
]
},
"sort":[{"member2_dummy7":{"order":"desc"}}],
"size":"0",
"aggs": {
"maritalstatus": {
"filter": {},
"aggs" : {
"filtered_maritalstatus": {"terms":{"field":"maritalstatus","size":5000}}
}
}
}
}
This query is giving me correct doc_count in aggregations. This doc_count is calculated over result set returned by query context and it ignores filter context.
I have written same query in elastic search 5.1:
{
"_source":["_id","spotlight"],
"query":{
"bool":{
"must":[
{"term":{"gender":"female"}},
{"range":{"lastlogindate":{"gte":"2016-10-19 12:39:57"}}}
],
"filter":{
"bool":{
"must":[
{"term":{"maritalstatus":"1"}}
]
}
}
}
},
"sort":[{"member2_dummy7":{"order":"DESC"}}],
"size":"0",
"aggs": {
"maritalstatus": {
"filter": {},
"aggs" : {
"filtered_maritalstatus": {"terms":{"field":"maritalstatus","size":5000}}
}
}
}
}
But in elastic search 5.1, it is returning me wrong doc_count in aggregation. I think it is taking filter in query context and hence, it is returning wrong doc_cout. Can someone tell me correct way to separate query and filter in elastic search 5.1?
Your 1.5 query uses post_filter which you have removed in your 5.1 query.
The equivalent query in ES 5.1 is the following (filtered/filter simply gets replaced as bool/filter and the top-level filter renamed to post_filter):
{
"_source": [
"_id",
"spotlight"
],
"query": {
"bool": {
"filter": [
{
"term": {
"gender": "female"
}
},
{
"range": {
"lastlogindate": {
"gte": "2016-10-19 12:39:57"
}
}
}
]
}
},
"post_filter": {
"term": {
"maritalstatus": "1"
}
},
"sort": [
{
"member2_dummy7": {
"order": "desc"
}
}
],
"size": "0",
"aggs": {
"maritalstatus": {
"filter": {},
"aggs": {
"filtered_maritalstatus": {
"terms": {
"field": "maritalstatus",
"size": 5000
}
}
}
}
}
}

Terrible has_child query performance

The following query has terrible performance.
100% sure it is the has_child. Query without it runs under 300ms, with it it takes 9 seconds.
Is there some better way to use the has_child query? It seems like I could query parents, and then children by id and then join client side to do the has child check faster than the ES database engine is doing it...
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "es"
}
}
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
Cluster info:
CPU and memory usage is low. It is AWS ES Service cluster (v1.5.2). Many small documents, and since version aws is running is old, doc values aren't on by default. Not sure if that is helping or hurting.
Since "stage" is not analyzed (based on your comment) and, therefore, you are not interested in scoring the documents that match on that field, you might realize slight performance gains by using the has_child filter instead of the has_child query. And using a term filter instead of a term query.
In the documentation for has_child, you'll notice:
The has_child filter also accepts a filter instead of a query:
The main performance benefits of using a filter come from the fact that Elasticsearch can skip the scoring phase of the query. Also, filters can be cached which should improve the performance of future searches that use the same filters. Queries, on the other hand, cannot be cached.
Try this instead:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "es"
}
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
I bit the bullet and just performed the parent:child join in my application. Instead of waiting 7 seconds for the has_child query, I fire off two consecutive term queries and do some post processing: 200ms.

Resources