ES7 - how to model 1-n parent-child relations - different ES types - elasticsearch

I am migrating an old ES instance to ES7.
We need 1-n parent-child relations.
We used to have multiple types in the same index and it was easy.
Some types were related to their parent via _parent.
But ES7 will only allow single-type indices.
Which makes me think I will convert the old types to separate indices.
I read the docs and they suggest using join for parent-child relations, however those seem to apply only to documents belonging to a single index.
https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch
So if I convert my previous types to separate indices, in my understanding join will not help.
So what is the right solution to model parent-child relation between different types (or should I say indices) in ES7?
Or maybe I should not model my data as separate types/indices in ES7. But in that case, how to solve this?
Thanks in advance

Yes, that's correct in using indices instead of types as ES deprecated that in version 7 hence we have to create multiple indexes to manage this use-case.
So now we have only two options:
Option 1: Denormalize the data and ingest documents accordingly.
Here again you can manage it in two ways:
Denormalize significantly in a way you continue to make use of join field or let's say denormalize 1-to-n child types into n indexes of to 1-to-1 parent-child type. Basically you would have as many indexes as many parent-child relations you've had in earlier version, however with parent being same in all the indexes. No of indexes = No of parent-child relationships
Second way to achieve this would be to completely denormalize the data in such a way you have a single index with all the information of all children from all types you've had in a single document. In this case no of index = 1
I guess if your children has unique fields, in that case I think the second one with single index may perform, but again you have not mentioned the number of documents you have so you would probably need to find a balance. Another technique is to make use of both as well.
Disadvantages in this case would be
Management of ingestion layer or jobs
Complexity in maintaining the structure of index
Performance issues as per this link in using join type
Keep an eye on future ES versions if they decide to modify parent-child feature although this is not to be considered for now.
Advantages:
Probably at the service layer which doesn't have to deal with Option 2 as discussed below
Able to co-relate with the use-cases you may have from the front-end application usage.
Options 2: Manage Join at application layer
Have a single parent index and multiple child indexes but manage the join at the application layer. If you have multiple 1-to-n mapping, then the number of indices would be n (parent = 1, child = n-1)
Disadvantages:
May or may not be able to easily co-relate with use-cases
Writing separate join logic at the application layer. Not to mention if you would want to do aggregation between parent and child, you'd have to write several for loops with multiple individual aggregation queries.
Advantages:
Ease of maintaining jobs or ingestion layer
Management of indexes would be less painful
Alternatively you can mix and match both the above options, depending on what use-cases you'd have.
So you see, both have their pluses and minus. If ingestion layer is easy in one, it becomes cumbersome in another, if service layer is easier to maintain in one, it becomes difficult in another.
Best way is to go ahead with some mock data, do some performance testing and see what factors you'd pitch in, ease of querying, maintenance of index, query or aggregation performances, ease of developing/managing both ingestion jobs and service layer etc.
May not be exactly what you are looking for, but I just hope this helps!

Related

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.
According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

Elastic Search - Joins best practises

I come across the following as part of docuementation
In Elasticsearch the key to good performance is to de-normalize your data into documents
And also,
the restriction about, where both the child and parent documents must be on the same shard
Given a scenario of multilevel hiearchy( grandparent --> parent ---> child ), where some of the parents have more childern than other and data might be skewed and few shards contain exponetially larger data than other shards.
What are the best practises with respect to gain more performance ?
Is it a good idea to put all the hiearchy in a single document ( rather than one document for each level). The parent data might be redudant if there are more childern as the parent data need to be copied to all the documents ?
Yes, both the statements which you mentioned are correct, and let me answer your both question in the context of your use-case.
Is it a good idea to put all the hierarchy in a single document (rather than one document for each level). The parent data might be redundant if there are more children as the parent data need to be copied to all the documents?
Answer: In general, if you have all the data in a single document searching, definitely searching will be much faster and that's the whole reason for denormalizing the data in databases which is also mentioned in the first statement, as you don't have to create multiple workers thread and combine the results from multiple documents/shards/nodes. also storage is cheap and although it will save the storage cost but save the computing cost(costlier than storage). in short, if you are worried about query performance than de-normalizing your data will give it a major boost.
What are the best practices with respect to gain more performance?
Answer: if you still go ahead with the normalization approach, then as mentioned you should keep all the related docs in the same shard and should implement custom routing to achieve that.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Is having empty fields bad for lucene index?

ES doc on mappings states below
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.
I'm wondering how strictly should I take this.
Say I have three types of documents, with each sharing same 60-70% of fields and the rest being unique to each type.
Should I put each type in a separate index?
Or one single index would be fine as well, meaning there won't be lots of storage waste or any noticeable performance hit on search or index operations?
Basically I'm looking for any information to either confirm or disprove the quote above.
If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that:
Things will improve in future versions of ES
If you don't need them, you can disable norms and doc_values, as recommended here

Elasticsearch - equivalent of LEFT JOIN

I have 20,000,000 line items in Elasticsearch that I am happily searching (it's working amazingly well).
There is an added dimension though that I don't know how to solve:
A user can "buy" those items (in batches of 1,000 to 100,000) and I need my search to only return the items that they have not previously "bought". I'd solve this with a LEFT JOIN in SQL.
I could add a boughtBy[] field to each item, but then I would need to update lots of documents every time a user buys. Feels kind of wrong?
Elasticsearch uses Lucene which supports blockjoin. In Elasticsearch that is Parent-Child Relationships. It gives you a join but it also comes with limitations (it's no longer possible to arbitrarily distribute documents across nodes, memory requirements can explode in certain scenarios).
Elasticsearch documentation gives you a nice overview of the relationship modelling options.
If you need deep joins, more complex relationships, etc., you might consider looking into the SIREn plugin.
(disclaimer: I currently work for the company that develops SIREn)

Resources