Elastic Search - Joins best practises - elasticsearch

I come across the following as part of docuementation
In Elasticsearch the key to good performance is to de-normalize your data into documents
And also,
the restriction about, where both the child and parent documents must be on the same shard
Given a scenario of multilevel hiearchy( grandparent --> parent ---> child ), where some of the parents have more childern than other and data might be skewed and few shards contain exponetially larger data than other shards.
What are the best practises with respect to gain more performance ?
Is it a good idea to put all the hiearchy in a single document ( rather than one document for each level). The parent data might be redudant if there are more childern as the parent data need to be copied to all the documents ?

Yes, both the statements which you mentioned are correct, and let me answer your both question in the context of your use-case.
Is it a good idea to put all the hierarchy in a single document (rather than one document for each level). The parent data might be redundant if there are more children as the parent data need to be copied to all the documents?
Answer: In general, if you have all the data in a single document searching, definitely searching will be much faster and that's the whole reason for denormalizing the data in databases which is also mentioned in the first statement, as you don't have to create multiple workers thread and combine the results from multiple documents/shards/nodes. also storage is cheap and although it will save the storage cost but save the computing cost(costlier than storage). in short, if you are worried about query performance than de-normalizing your data will give it a major boost.
What are the best practices with respect to gain more performance?
Answer: if you still go ahead with the normalization approach, then as mentioned you should keep all the related docs in the same shard and should implement custom routing to achieve that.

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.
According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Best way to store votes in elasticsearch for a reddit like system

I am building a site similar to reddit using elasticsearch and trying to decide where is the best place to store the up/down votes. I can think of couple options.
Store as part of the document.
In this case, any vote will trigger an update on the document. According to elasticsearch document, this is essentially a replace of the whole document. That seems to be a very expensive operation.
Store in another database.
Store votes in other database like SQL/MongoDB and update elasticsearch periodically. In this case, we have to tolerate some delay for the new votes to affect search result which is not so ideal and will also increase complexity and maintenance cost.
Store in another index in elasticsearch
This can separate the concern by index - one mostly RO, one RW. Is there an efficient way to merge the two indices so that I can order by votes at query time?
Any suggestions on those options or other better way to handle this?
There is a forth option - store votes in a separate document with a different type but in the same index as the original document. The votes type can be made a child of the article type. This setup will enable you to perform queries against articles and votes at the same time using has_child filters and queries. It will also require reindexing of only a small votes document every time a vote occurs instead of the large article document. On the negative side, the has_child and has_parent queries require loading of the parent/child map into memory, so this approach has a non-trivial memory footprint comparing to all other options that you have described.

elasticsearch - tips on how to organize my data

I'm trying elasticsearch by getting some data from facebook and twitter to.
The question is: how can I organize this data in index?
/objects/posts
/objects/twits
or
/posts/post
/twits/twit
I'm trying queries such as, get posts by author_id = X
You need to think about the long term when deciding how to structure your data in Elasticsearch. How much data are you planning on capturing? Are search requests going to look into both Facebook and Twitter data? Amount of requests, types of queries and so on.
Personally I would start of with the first approach, localhost:9200/social/twitter,facebook/ as this will reduce the need for another index when it isn't necessarily required. You can search across both of the types easily which has less overhead than searching across two indexes. There is quite an interesting article here about how to grow with intelligence.
Elasticsearch has many configurations, essentially its finding a balance which fits your data.
First one is the good approach. Because creating two indices will create two lucence instances which will effect the response time.

Resources