Performace queries regarding inner_hits for nested field? - elasticsearch

Below are the specification about my index:
Around 600-700 million documents
_source is disabled
For each field in index store=true is marked.
ES version 7.10.2
When a search is performed on the document instead of returning all the fields/or whole document, User can ask for a specific field(for which store is marked as true).
But the nested fields are not returned as part of stored field, as mentioned in ES doc.
On its own, stored_fields cannot be used to load fields in nested
objects — if a field contains a nested object in its path, then no
data will be returned for that stored field. To access nested fields,
stored_fields must be used within an inner_hits block.
Query:
POST order/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"_id": "009"
}
},
{
"nested": {
"path": "OrderStatus",
"query": {
"match_all": {}
},
"inner_hits": {
"stored_fields" : ["OrderStatus.Status"]
}
}
}
]
}
}
}
But my main concern is what is the performance cost of making such nested inner hits queries:
As of each hit, an nested inner hit query will be made, if my search result hits 20 million records, for each of those 20 million, it will make an inner hit query, will it not degrade the performance?
I have gone through # of articles for the same, but most of them are for the older versions, here is one of the discussion:
https://github.com/elastic/elasticsearch/issues/14229
Can anyone please help me with this?

Related

How to correctly denormalize one-to-many indexes coming from multiple sources

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?
One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.
As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

Elasticsearch join-like query on multiple types and different fields

I have an Elasticsearch index called my_index which contains documents of two types, Type1 and Type2.
The two document types contain different data about the same type of entity.
The two document types both contain the ID of the related entity.
I've been trying to construct a join-like query which would return entities which match conditions on both document types, but I can't get it to work, and I also can't find any citation in the Elasticsearch multi-type or query documentation that says it's not possible.
The problem I'm trying to solve is avoiding having to manually join two result sets by getting all Type1 hits and all Type2 hits and doing the join outside of Elasticsearch, since the index has millions of documents.
The equivalent in SQL would be
select * from
Type1 inner join Type2
on Type2.EntityId = Type1.EntityId
where
Type1.Field = Condition AND
Type2.Field = Condition [...]
The URL I'm using to query against is http://elastic/my_index/Type1,Type2/_search to include both document types.
If I perform a blank query against this URL, I get hits of both Type1 and Type2.
If I add a criterion for Type1, it works as expected:
{ "query": {
"bool": {
"must": [{
"term": {
"FieldOnType1": "lorem" } } ] } } }
Somehow Elasticsearch can infer that FieldOnType1 is indeed a field on Type1.
When I add a criterion for Type2, I don't get any hits:
{ "query": {
"bool": {
"must": [{
"term": {
"FieldOnType1": "lorem" } }, {
"term": {
"FieldOnType2": "ipsum" } } ] } } }
In reality, there are sometimes more than 2 term queries, or range queries and term queries.
I'm guessing the problem with the above query is that no single document can match both criteria at once.
I've tried
using should instead of must, and I've tried
qualifying the field names with type names, and I've tried
many variations of the query (including using filters instead of queries)
but everything gives me 0 hits.
Similar questions here suggest to use the Elasticsearch multi-search API instead of the search API, but that won't solve my "manual join" problem.
Is there a way to make an elaborate "OR" query that would allow queries on both types? Or something else?
Try multi_match query (I use ES 6, so have index p/type):
GET index1,index2/_search
{
"query":{
"multi_match": {
"query": "1",
"fields": ["FieldOnType1", "FieldOnType2"]
}
}
}
If you need to use different fields, should should work:
GET test,test1/_search
{
"query":{
"bool": {
"should": [
{
"term": {"firstName": "john"}
},
{
"term": {"firstName1": "jerry1"}
}
]
}
}
}

How to specify the execution order of filter and query in an Elasticsearch query

Consider the following query in Elasticsearch:
GET nyc_visionzero/_search
{
"query": {
"bool": {
"must": [{
"fuzzy": {
"on_street_name": "AVENUE"
}
}
],
"filter": {
"term": {
"borough": "MANHATTAN"
}
}
}
}
}
Is the filter part executed first and then fuzzy or its the other way around? What if I want to change the order of their execution! How can I do that?
This question relates to the query vs. filter context topic. Everything in the query context (here query.bool.must) counts to the score of a document whereas the conditions in the filter context (query.filter) are a yes/no decision.
So from a performance perspective, filters are faster and can be cached. On the other side queries allow for some fuzziness.
There is a much more detailed explanation on this in the elasticsearch docs on query and filter context.

Constructing a NEST/ElasticSearch query with nested properties

I'm querying an ElasticSearch database (the Danish CVR registry) using NEST in C#. I'm trying to formulate a query that will query this scheme:
relations: [
{
participant: {
key: 123123
},
organisations: [
{
organisationName: {
name: "some string",
period: {
from: "SOME DATE"
to: "SOMEDATE OR NULL"
}
},
... more of similar objects ..
}
]
},
.. more of similar objects ..
]
My problem here is that I need to find documents that have a certain participant.key value, while at the same time has a specific organisations.organisationName.name and a missing or null value in organisations.organisationName.period.to
I know I need to use a nested query to get documents that have both a null value in the to field and a certain name in the name field, but on top of that I need to also have the specific key in the particiant.key field, and this is where I'm having trouble. Note that all 3 fields that I'm checking must be within the same relations object, and the to and name fields must be within the same organisationName object.
The query without the key part as a JSON query is this:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "relations.organisations.organisationName",
"score_mode": "max",
"query": {
"bool": {
"must": [
{ "match": { "relations.organisations.organisationName.name": "EJERREGISTER" }},
{"filtered": { "filter" : {
"missing" : { "field" : "relations.organisations.organisationName.period.to" }
} } }
]
}}}}
]
}}}
Hoping someone out there is apt at making these queries in the NEST Query DSL. I could also work from a pure ElasticSearch JSON query, but the .NET equivalent would be my preferred option :)
Thanks in advance!
After some experimentation I came to the conclusion that the right answer to my problem would be a query with a nested query that 1. Checks the key, and 2. has a nested query that does the other things I needed in organisation.organisationName object.
I couldn't quite verify this, however, because the database I'm querying does not have the relations-object marked as nested (and I can't change that since it's a government database)
My workaround was to retrieve all relations related to my keys, and then filtering out the remaining objects in memory, as this wasn't too much overhead in my scenario.
Edit: as a follow up, the external database I was using added the nested clause, and it worked as explained above.

Return list of affected indices from in Elasticsearch

I need to write a query which will search across all indices in Elastisearch and return me a list of all indices where at least one document meets query requirements.
For now I`m getting top 2000 documents and distinct them by index name.
To search across all indices in the elastcsearch, you can use the _all option.
You can try similar to following, to get the indices which gets hits for the query
POST _all/_search
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "you search criteia"
}
}
}
}
}
Most APIs that refer to an index parameter support execution across multiple indices, using simple test1,test2,test3 notation (or _all for all indices)
You can extract the index name from the result set which will be present under _index
sample result:
"hits": [
{
"_index": "index-name",
}
]

Resources