Advice on efficient ElasticSearch document design

Advice on efficient ElasticSearch document design - elasticsearch

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)

In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).

I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

Related

What's the difference bettween Fielddata enabled vs eager global ordinals for optimising initial search query latency in Elasticsearch

I have an elasticsearch (7.10) cluster running that is primarily meant for powering search on text documents. The index that I'm working with does not need to be updated often, and there is no great necessity for speed during index time. Performance in this system is really needed for search time. The number of documents will likely always be in the range of 50-70 million and the store size is ~300GB once it's all built.
The mapping for the index and field I'm concerned with looks something like this:
"mappings": {
"properties": {
"document_text": {
"type": "text"
}
}
}
The document_text is a string of text anywhere in the region of 50-500 words. The typical queries being sent to this index are match queries chained together inside a boolean should query. Usually, the number of clauses are in the range of 5-15.
The issue I've been running into is that the initial latency for search queries to the index is very high usually in the range of 4-6s but after the first search the document is cached so the latency becomes much lower <1s. The cluster has 3 data nodes, 3 master nodes and 2 ingest/client nodes and is backed by fast SSD. I noticed that the heap on the data nodes is never really under too much pressure nor is the RAM this led me to realize that the documents weren't cached in advance the way I wanted them to be. From what I've researched I've landed on either enabling fielddata=true to get the field data object in memory at index time rather than constructing that at search time. I understand this will increase pressure on the JVM heap so I may do some frequency filtering to only place certain documents in memory. The other option I've come across is setting eager_global_ordinals=true which in some ways seems similar to enabling fielddata as it builds the mappings in-memory at index time also. I'm a bit new with ES and the terminology between the two is somewhat confusing to me. What I'd love to know is what is the difference between the two and does enable one or both of them to seem reasonable to solve the latency issues I'm having or I have completely misunderstood the docs. Thanks!

Enabling eager_global_ordinals won't do any kind of effect on your queries. Enabling global ordinals would only help for aggregations, doc values would be loaded at index refresh time instead of loading them at query time.
Enabling fielddata would also not do any real effect on your queries. Its primary purpose is sorting and aggregation, which you don't really want to do on a text field.
There's probably not much you can do with first ES queries being slower. Better focus on optimal index mappings, settings, shards, and document sizes.

Elasticsearch index policy creation best practice/performance

I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?

Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.

By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.

Elasticsearch - implications of splitting documents into separate indexes

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.
Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.
What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:
Will it increase my server HDD requirements?
Will it increase my server RAM requirements?
Will elasticsearch be slower to search by querying the alias to query all the indexes?
Thank you for any help or advice.

Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:
Require some amount of RAM for insert buffers and other per-index related tasks
Have it's own merge overhead on disk relative to a larger single index
Provide some latency increase at query time due to result merging if a query spans multiple indexes
There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.
I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.

TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.

Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.

I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.

You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.

It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio