ElasticSearch Scale Forever - elasticsearch

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?

A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Related

Elasticsearch index policy creation best practice/performance

I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.

Advice on efficient ElasticSearch document design

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)
In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).
I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

how do handle if data in each table increases within same index in elasticsearch

index created for multiple table from a database. Over a period of time record of a couple of table increased and a couple of table has millions of record. but initially it has hundreds of record. So if count of records are increased, how do make better performance.
1) do we need to move the table from old index to newly creating index
or
2) increase the nodes and shards of existing index, thus make better performance.
so i am looking better solution and pls let me know, if my requirement is not clear.
Could anybody answer please.
It sounds like perhaps you should consider using timeseries-based indices, so
you would create an index for every day or month (or whatever time period you
wanted) and then you could use a tool like
curator to manage them. This way you have
more flexibility with what to do with older indices, like closing them, deleting
them, or force-merging them using the optimize API.
If you already have performance issues, it's going to be harder. Since the number of shards is fixed at index creation time, you will have to reindex the data to a new index with more shards.
If you have tables that grow indefinitely (e.g. logs) plan ahead with time-based indexes. If your data is not time-based you can do the same trick. Use templates to automatically create indexes and aliases so you can query them as it would be one index.
There is no golden rule here, but once you know that for example how your index scales for your usecase for 1M records you can do some automatic indexing by id from your primary storage (db). All you have to do is, when indexing pick the right index to write to (since you can't use alias for indexing), querying is transparent through alias. This is a minor change for most apps.

Elasticsearch - implications of splitting documents into separate indexes

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.
Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.
What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:
Will it increase my server HDD requirements?
Will it increase my server RAM requirements?
Will elasticsearch be slower to search by querying the alias to query all the indexes?
Thank you for any help or advice.
Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:
Require some amount of RAM for insert buffers and other per-index related tasks
Have it's own merge overhead on disk relative to a larger single index
Provide some latency increase at query time due to result merging if a query spans multiple indexes
There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.
I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

Should I control the Index size in Elastic Search?

I have a fast growing database and I'm using Elastic Search to manage it.it has only one index and gets 200 K new documents per day. each document contains of about 5 KB text.
Should I keep using only one index or it's better to have one index for each day or something else?
If so, what's the benefits of having multiple indices?
You should definitely worry about the maximum size of your shards/index. We use daily indexes for stuff where we are inserting millions of records per day and monthly indexes where were are inserting millions per month.
A good rule of thumb is that shards should max out around 4 GB (remember there are a configurable number of shards per index).
The advantage is that when you have daily/weekly/monthly indexes, you can eventually close/delete them when your cluster becomes too big or the data isn't useful anymore. If your data is time series data, you can craft your queries to only hit the indexes that are used for the given data. Also if you've made a mistake in how many shards you really need, you can correct it going forward (because you create a new index periodically).
The disadvantage is then that you have to manage all of the extra indexes, but there are tools to do that (elasticsearch-curator for example).

Resources