Performance wise, should we favor nested documents or joining over multiple collections - performance

Performance wise, what is the best option - use nested documents ("_childDocuments") or joins over multiple collections?

Related

How does Elasticsearch/Lucene achieve such performance when querying multiple fields?

According to the answer given here, Elasticsearch doesn't seem to use compound indexes for querying multiple fields, and instead queries multiple indexes and then intersects the results.
My question is how does it achieve such high performance? Surely a composite index is faster since it leads you straight to the desired data, rather than querying multiple indexes, which in turn return more data, and then compare the results?
I get the advantages of the multiple indexes, regarding the field order, etc., but in terms of performance, surely it's inferior...

composite aggregation vs nested terms aggregation

Hi I am currently using nested terms aggregations (triple or more) to query elasticsearch. I would rather use the composite aggregation with 3+ source fields that i just discovered since it is way more manageable in my opinion, but I was wondering if performance-wise this is a bad choice. Any recommendation ?

ES7 - how to model 1-n parent-child relations - different ES types

I am migrating an old ES instance to ES7.
We need 1-n parent-child relations.
We used to have multiple types in the same index and it was easy.
Some types were related to their parent via _parent.
But ES7 will only allow single-type indices.
Which makes me think I will convert the old types to separate indices.
I read the docs and they suggest using join for parent-child relations, however those seem to apply only to documents belonging to a single index.
https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch
So if I convert my previous types to separate indices, in my understanding join will not help.
So what is the right solution to model parent-child relation between different types (or should I say indices) in ES7?
Or maybe I should not model my data as separate types/indices in ES7. But in that case, how to solve this?
Thanks in advance
Yes, that's correct in using indices instead of types as ES deprecated that in version 7 hence we have to create multiple indexes to manage this use-case.
So now we have only two options:
Option 1: Denormalize the data and ingest documents accordingly.
Here again you can manage it in two ways:
Denormalize significantly in a way you continue to make use of join field or let's say denormalize 1-to-n child types into n indexes of to 1-to-1 parent-child type. Basically you would have as many indexes as many parent-child relations you've had in earlier version, however with parent being same in all the indexes. No of indexes = No of parent-child relationships
Second way to achieve this would be to completely denormalize the data in such a way you have a single index with all the information of all children from all types you've had in a single document. In this case no of index = 1
I guess if your children has unique fields, in that case I think the second one with single index may perform, but again you have not mentioned the number of documents you have so you would probably need to find a balance. Another technique is to make use of both as well.
Disadvantages in this case would be
Management of ingestion layer or jobs
Complexity in maintaining the structure of index
Performance issues as per this link in using join type
Keep an eye on future ES versions if they decide to modify parent-child feature although this is not to be considered for now.
Advantages:
Probably at the service layer which doesn't have to deal with Option 2 as discussed below
Able to co-relate with the use-cases you may have from the front-end application usage.
Options 2: Manage Join at application layer
Have a single parent index and multiple child indexes but manage the join at the application layer. If you have multiple 1-to-n mapping, then the number of indices would be n (parent = 1, child = n-1)
Disadvantages:
May or may not be able to easily co-relate with use-cases
Writing separate join logic at the application layer. Not to mention if you would want to do aggregation between parent and child, you'd have to write several for loops with multiple individual aggregation queries.
Advantages:
Ease of maintaining jobs or ingestion layer
Management of indexes would be less painful
Alternatively you can mix and match both the above options, depending on what use-cases you'd have.
So you see, both have their pluses and minus. If ingestion layer is easy in one, it becomes cumbersome in another, if service layer is easier to maintain in one, it becomes difficult in another.
Best way is to go ahead with some mock data, do some performance testing and see what factors you'd pitch in, ease of querying, maintenance of index, query or aggregation performances, ease of developing/managing both ingestion jobs and service layer etc.
May not be exactly what you are looking for, but I just hope this helps!

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Does solr support the sorting while creating index?

In my test environment, there are nearly 130,000,000 documents on each server. It works fast if I do a search without sorting by date, but extremly slow if sorting is enabled.
I think if the solr can sort an indexed field while creating index, searching would be more efficient. So, how to configure the solr to sort some fields while indexing?
The initial query would be slower but all the subsequent queries should be fast.
Solr should be able to use the Filter Query Cache for sorting.
You can also warm the sort fields.
Also check if the overhead is also just cause of sorting and there is no querying and scoring involved.

Resources