how do handle if data in each table increases within same index in elasticsearch - elasticsearch

index created for multiple table from a database. Over a period of time record of a couple of table increased and a couple of table has millions of record. but initially it has hundreds of record. So if count of records are increased, how do make better performance.
1) do we need to move the table from old index to newly creating index
or
2) increase the nodes and shards of existing index, thus make better performance.
so i am looking better solution and pls let me know, if my requirement is not clear.
Could anybody answer please.

It sounds like perhaps you should consider using timeseries-based indices, so
you would create an index for every day or month (or whatever time period you
wanted) and then you could use a tool like
curator to manage them. This way you have
more flexibility with what to do with older indices, like closing them, deleting
them, or force-merging them using the optimize API.

If you already have performance issues, it's going to be harder. Since the number of shards is fixed at index creation time, you will have to reindex the data to a new index with more shards.
If you have tables that grow indefinitely (e.g. logs) plan ahead with time-based indexes. If your data is not time-based you can do the same trick. Use templates to automatically create indexes and aliases so you can query them as it would be one index.
There is no golden rule here, but once you know that for example how your index scales for your usecase for 1M records you can do some automatic indexing by id from your primary storage (db). All you have to do is, when indexing pick the right index to write to (since you can't use alias for indexing), querying is transparent through alias. This is a minor change for most apps.

Related

Do I need to split order data into multiple time based index in Elasticsearch?

I am planning to use Elasticsearch to store user orders data. There could be 20 million orders per year in my system. 20 million orders probably take about 10GB size.
My question is whether I should create one index to include all orders' data. I have read ES doc saying we'd better keep 20GB data in one primary shard. If I create one index with 5 primary shards, does it mean I am fine to save 100GB (200 millions) orders in this index?
Another approach is to create index per year, for example, I create index order-2020, order-2021, order-2022 etc. And I can create less primary shard for each index. I understand using this pattern may benefit if I want to add a retention period on my order data. But apart from that, what other benefits I can have to use this pattern?
From query performance perspective, which approach is better?
In terms of search speed and aggregation accuracy, multi-index multi-fragment will inevitably have some loss, but in terms of data health, it is recommended to split the data by year, you can use alias to establish index association, and the loss in query performance is much less than that in aggregation.

In ElasticSearch i have to create single index and multiple types or multiple index with single types?

I am new in elastic search.I am using elastic search for big data.
There is not join query in my application then which structure is best for my application?
I am working on elasticserach from past few days. I would like to share my experience/learnings.
1) If we moving from relational DB like MYSQL, SQL to ES, We need to maintain all relation among all data. Declare the primary key in different types or indexes, On basis of which you can perform Query DSL.
2) In case of if you dealing with millions data everyday, You need to design accordingly. Some people prefer duration based structure like Day, Week, Month wise. Its totally depend on your use case. For large data set (~ 1TB) you need to distribute your data in various of indexes and shards .
3) If you have small data set the it will be work in default settings too (5 shrads 1 replica). It will give you better If data set is small in your shards.
4) The JOIN query can be expensive in elasticsearch. And if you frequently performing it can be impact to your HEAP. So I would suggest prepare your data set with pre-cooked data (The result data which you getting when you perform join query in Relational DBs.) & document with unique ID. You can refer this. Check here to look, How we can perform JOIN
5) There might be some points which you need to take care while designing your index:
Don't treat Elasticsearch like a database
Know your use case BEFORE you jump in
Organize your data wisely
Make smart use of replicas
Base your capacity plans on experiment
6) Your wrong architecture can cause reindex which will be heavy cost with downtime. Checkout this article to know about index designing and best practices.

elasticsearch ttl vs daily dropping tables

I understand that there are two dominant patterns for keeping a rolling window of data inside elasticsearch:
creating daily indices, as suggested by logstash, and dropping old indices, and therefore all the records they contain, when they fall out of the window
using elasticsearch's TTL feature and a single index, having elasticsearch automatically remove old records individually as they fall out of the window
Instinctively I go with 2, as:
I don't have to write a cron job
a single big index is easier to communicate to my colleagues and for them to query (I think?)
any nightmare stream dynamics, that cause old log events to show up, don't lead to the creation of new indices and the old events only hang around for the 60s period that elasticsearch uses to do ttl cleanup.
But my gut tells me that dropping an index at a time is probably a lot less computationally intensive, though tbh I've no idea how much less intensive, nor how costly the ttl is.
For context, my inbound streams will rarely peak above 4K messages per second (mps) and are much more likely to hang around 1-2K mps.
Does anyone have any experience with comparing these two approaches? As you can probably tell I'm new to this world! Would appreciate any help, including even help with what the correct approach is to thinking about this sort of thing.
Cheers!
Short answer is, go with option 1 and simply delete indexes that are no longer needed.
Long answer is it somewhat depends on the volume of documents that you're adding to the index and your sharding and replication settings. If your index throughput is fairly low, TTLs can be performant but as you start to write more docs to Elasticsearch (or if you a high replication factor) you'll run into two issues.
Deleting documents with a TTL requires that Elasticsearch runs a periodic service (IndicesTTLService) to find documents that are expired across all shards and issue deletes for all those docs. Searching a large index can be a pretty taxing operation (especially if you're heavily sharded), but worse are the deletes.
Deletes are not performed instantly within Elasticsearch (Lucene, really) and instead documents are "marked for deletion". A segment merge is required to expunge the deleted documents and reclaim disk space. If you have large number of deletes in the index, it'll put much much more pressure on your segment merge operations to the point where it will severely affect other thread pools.
We originally went the TTL route and had an ES cluster that was completely unusable and began rejecting search and indexing requests due to greedy merge threads.
You can experiment with "what document throughput is too much?" but judging from your use case, I'd recommend saving some time and just going with the index deletion route which is much more performant.
I would go with option 1 - i.e. daily dropping of indices.
Daily Dropping Indices
pros:
This is the most efficient way of deleting data
If you need to restructure your index (e.g. apply a new mapping, increase number of shards) any changes are easily applied to the new index
Details of the current index (i.e. the name) is hidden from clients by using aliases
Time based searches can be directed to search only a specific small index
Index templates simplify the process of creating the daily index.
These benefits are also detailed in the Time-Based Data Guide, see also Retiring Data
cons:
Needs more work to set up (e.g. set up of cron jobs), but there is a plugin (curator) that can help with this.
If you perform updates on data then all versions of a document data will need to sit in the same index, i.e. multiple indexes won't work for you.
Use of TTL or Queries to delete data
pros:
Simple to understand and easily implemented
cons:
When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away. This is very inefficient as the deleted data will consume disk space, CPU and memory.

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?
A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Should I control the Index size in Elastic Search?

I have a fast growing database and I'm using Elastic Search to manage it.it has only one index and gets 200 K new documents per day. each document contains of about 5 KB text.
Should I keep using only one index or it's better to have one index for each day or something else?
If so, what's the benefits of having multiple indices?
You should definitely worry about the maximum size of your shards/index. We use daily indexes for stuff where we are inserting millions of records per day and monthly indexes where were are inserting millions per month.
A good rule of thumb is that shards should max out around 4 GB (remember there are a configurable number of shards per index).
The advantage is that when you have daily/weekly/monthly indexes, you can eventually close/delete them when your cluster becomes too big or the data isn't useful anymore. If your data is time series data, you can craft your queries to only hit the indexes that are used for the given data. Also if you've made a mistake in how many shards you really need, you can correct it going forward (because you create a new index periodically).
The disadvantage is then that you have to manage all of the extra indexes, but there are tools to do that (elasticsearch-curator for example).

Resources