Single query for terms aggregation with or without nested objects - elasticsearch

I want to get an aggregation of all the unique values for a field.
I found that this query works well for a "simple" field with no nested type in the json path.
GET /_search
{
"aggs": {
"myResult": {
"terms": { "field": "school.student.name" }
}
},
"size": 0
}
However, when one of the objects in the path is of nested type, I have to make a different request. Let say student is nested just for the example, then I change the query like this:
GET /_search
{
"aggs": {
"myResult": {
"nested": {
"path": "school.student"
}
"aggs": {
"myResult2": {
"terms": { "field": "school.student.name" }
}
}
}
},
"size": 0
}
However, in my app, I only have the json path such as "school.student.name" but I do not know if there is a nested object, and who is the nested.
I did not find any solution to make a single request that could work to handle both case, and I end up with a configuration file where I have to explicitly list all the nested path that I have in my elastic database, to build the correct request. However, if I have hundreds of fields, this is not maintainable.
Does elastic really do not implement such a simple and basic feature?
I just want to get all unique values corresponding to a json path in my elastic database.
Would you have a better solution than the config file?

No need for configuration file, you should directly retrieve the mapping as it is currently stored in ES, parse it and you'd always have the right fields.

Related

How to correctly denormalize one-to-many indexes coming from multiple sources

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?
One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.
As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

Performace queries regarding inner_hits for nested field?

Below are the specification about my index:
Around 600-700 million documents
_source is disabled
For each field in index store=true is marked.
ES version 7.10.2
When a search is performed on the document instead of returning all the fields/or whole document, User can ask for a specific field(for which store is marked as true).
But the nested fields are not returned as part of stored field, as mentioned in ES doc.
On its own, stored_fields cannot be used to load fields in nested
objects — if a field contains a nested object in its path, then no
data will be returned for that stored field. To access nested fields,
stored_fields must be used within an inner_hits block.
Query:
POST order/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"_id": "009"
}
},
{
"nested": {
"path": "OrderStatus",
"query": {
"match_all": {}
},
"inner_hits": {
"stored_fields" : ["OrderStatus.Status"]
}
}
}
]
}
}
}
But my main concern is what is the performance cost of making such nested inner hits queries:
As of each hit, an nested inner hit query will be made, if my search result hits 20 million records, for each of those 20 million, it will make an inner hit query, will it not degrade the performance?
I have gone through # of articles for the same, but most of them are for the older versions, here is one of the discussion:
https://github.com/elastic/elasticsearch/issues/14229
Can anyone please help me with this?

Aggregate over top hits ElasticSearch

My documents are structured in the following way:
{
"chefInfo": {
"id": int,
"employed": String
... Some more recipe information ...
}
"recipe": {
... Some recipe information ...
}
}
If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicate.
So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.
I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.
Is it possible what I want to do?
For elastic every document in itself is unique. In your case you want to define uniqueness based on a different field, here chefInfo.id. To find unique count based on this field you have to make use of cardinality aggregation.
You can apply the aggregation as below:
{
"aggs": {
"employed": {
"nested": {
"path": "chefInfo"
},
"aggs": {
"employed": {
"terms": {
"field": "chefInfo.employed.keyword"
},
"aggs": {
"employed_unique": {
"cardinality": {
"field": "chefInfo.id"
}
}
}
}
}
}
}
}
In the result employed_unique give you the expected count.

Constructing a NEST/ElasticSearch query with nested properties

I'm querying an ElasticSearch database (the Danish CVR registry) using NEST in C#. I'm trying to formulate a query that will query this scheme:
relations: [
{
participant: {
key: 123123
},
organisations: [
{
organisationName: {
name: "some string",
period: {
from: "SOME DATE"
to: "SOMEDATE OR NULL"
}
},
... more of similar objects ..
}
]
},
.. more of similar objects ..
]
My problem here is that I need to find documents that have a certain participant.key value, while at the same time has a specific organisations.organisationName.name and a missing or null value in organisations.organisationName.period.to
I know I need to use a nested query to get documents that have both a null value in the to field and a certain name in the name field, but on top of that I need to also have the specific key in the particiant.key field, and this is where I'm having trouble. Note that all 3 fields that I'm checking must be within the same relations object, and the to and name fields must be within the same organisationName object.
The query without the key part as a JSON query is this:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "relations.organisations.organisationName",
"score_mode": "max",
"query": {
"bool": {
"must": [
{ "match": { "relations.organisations.organisationName.name": "EJERREGISTER" }},
{"filtered": { "filter" : {
"missing" : { "field" : "relations.organisations.organisationName.period.to" }
} } }
]
}}}}
]
}}}
Hoping someone out there is apt at making these queries in the NEST Query DSL. I could also work from a pure ElasticSearch JSON query, but the .NET equivalent would be my preferred option :)
Thanks in advance!
After some experimentation I came to the conclusion that the right answer to my problem would be a query with a nested query that 1. Checks the key, and 2. has a nested query that does the other things I needed in organisation.organisationName object.
I couldn't quite verify this, however, because the database I'm querying does not have the relations-object marked as nested (and I can't change that since it's a government database)
My workaround was to retrieve all relations related to my keys, and then filtering out the remaining objects in memory, as this wasn't too much overhead in my scenario.
Edit: as a follow up, the external database I was using added the nested clause, and it worked as explained above.

Elasticsearch query with nested aggregations causing out of memory

I have Elasticsearch installed with 16gb of memory. I started using aggregations, but ran into a "java.lang.OutOfMemoryError: Java heap space" error when I attempted to issue the following query:
POST /test-index-syslog3/type-syslog/_search
{
"query": {
"query_string": {
"default_field": "DstCountry",
"query": "CN"
}
},
"aggs": {
"whatever": {
"terms": {
"field" : "SrcIP"
},
"aggs": {
"destination_ip": {
"terms": {
"field" : "DstIP"
},
"aggs": {
"port" : {
"terms": {
"field" : "DstPort"
}
}
}
}
}
}
}
}
The query_string itself only returns 1266 hits so I'm a bit confused by the OOM error.
Am I using aggregations incorrectly? If not, what can I do to troubleshoot this issue?
Thanks!
You are loading the entire SrcIP-, DstIP-, and DstPort-fields into memory in order to aggregate on them. This is because Elasticsearch un-inverts the entire field to be able to rapidly look up a document's value for a field given its ID.
If you're going to largely be aggregating on a very small set of data, you should look into using docvalues. Then a document's value is stored in a way that makes it easy to look up given the document's ID. There's a bit more overhead to it, but that way you'll leave it to the operating system's field cache to have the relevant pages in memory, instead of having to load the entire field.
Not sure about the mapping of course, but looking at the value the field DstCountry can be non_analyzed. Than you could replace the query by a filter within the aggregate. Maybe that helps.
Also check if the fields you use in your aggregation are of type non_analyzed.

Resources