Migrating very large elasticsearch indices - elasticsearch

We ran out of space due to a very large indices (5TB primary | 5TB replica). This indices has 5 shards (each shard is 1TB). We are planning to migrate this indices to bigger AWS instance type. Please let me know what are the settings that can be modified for the migration to go fast and smooth?
Note: We are using default elasticsearch settings.

First of all, I'd like to point out that having a 1TB shard is way off from the recommended 30gb limit. I'd also assume that due to this your cluster probably isn't as optimised as expected in even extreme scenarios.
Secondly, the recommended settings would depend on the track you're using to migrate this index?
I'd personally let snapshot/restore to take care of the process as it would use the least bandwidth and hence refusing the transfer time. Once done, since snapshot is already in an AWS region, it would be faster to restore.
Again, I'm assuming a lot here so alot depends on your limitations and preferred method.
All the best.

Related

Is elasticsearch safe in single node in production environment of sensitive information?

Elastic on single node it can be faster than cluster, but what are the advantages and disadvantages of using it in a production environment with only one node.
I have a problem because my DBA wants to use elastic in single node with the justification of speed, but he is not taking into account availability, redundancy and failures against disaster the system is not slow, but all documentation I read about elastic says that in production environment it needs to run in cluster with his nodes/shard, We are an information bureau, we provide data for banks, credit analysis, and numerous large customer applications. Help me with arguments that prove that I am sure that the information we are dealing with needs high availability and redundancy. The index size is about 2.2TB I wanted to run on cluster as the information is very sensitive, but my DBA wants to run on single node, on production environment Help me give him an answer, if he's right or I'm wrong.
You should run a benchmark of the potential workloads if he needs proof.
It isn't really viable to run any real production load on a single system because the shards are what really create the speed of elasticsearch: Queries are run on many shards and partial results are returned to form the full result. If you use a single node to scan the entire dataset it will take a very long time for that single node to process everything.
I guess the issue really comes down to how much your data is getting updated and how many queries are running in parallel.
Without any network traffic it could seem faster on a small workload, but if your search cluster needs to do continuous indexing and run parallel queries it will just get stuck and stop returning results.
If a single-node cluster will be faster or slower than a multi-node cluster depends on the use case and many other factors, the argument that a single-node is faster is not valid without a comparative benchmark of the real use case.
For example, If your use case has more indexing than searchs, a single-node cluster can be faster, if your use case has more searchs than indexing, then a single-node cluster can be slower.
But even if it is faster in a specific use case, running a single-node cluster in production is not recommended and it is very risky.
The main issue is that a single-node has no resilience to failure, if your node is down, your entire cluster is down and until your node is back up running, the data in your elasticsearch cluster is unavailable.
Depending on how you will ingest the data, this can also lead to data loss.
If for some reason the data in your node gets corrupted or lost, you will need to restore it from a previous snapshot. On a multi-node cluster if a node is lost, you can spin-up another one and the cluster will take care of replicating the data, assuming you use replicas for your index.
There are also some limitations on how many shards you can have on a node and how many memory you should use for the java heap memory, the recomendations by elastic in those cases is to try to keep the number of shards per GB of heap memory below 20 and do not use more than 30 GB for the heap memory, so for a single-node this will give you a maximum of 600 shards.
If this is enough, depends on the use case and your indexing/shard strategy.
You should ask yourself if you can afford downtime and lose data, if the answer is no to both, then you should not use a single-node cluster.

Elasticsearch maximum index count limit

Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue

Elasticsearch - general architecture and Elastic Cloud questions

Background
We're designing the architecture of a new system using Elasticsearch now, and we plan to use Elastic Cloud based on reviews contrasting their service with AWS's, and self-hosting on an EC2 instance. As we design the system, I'm trying to learn from a small test project my team deployed on Elastic Cloud 6 months ago. While I've spent a lot of time reading the Elasticsearch Docs, Elasticsearch: The Definitive Guide, and Elastic Cloud's Docs, there are some concepts here that I'm still not understanding.
Our Test Project's issues
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node, currently with 2GB of memory. Because there is only one node, and because replica shards are never assigned to the same node as their primary shard (reason 2), none of the replicas are getting assigned. Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards), and over time, the proverbial Kagillion Shards. This system was only ever meant to have several months of data on it at a time, so the solution has been to manually delete old data when memory on this deployment runs out.
The New System
Our new system is meant to have 5 years worth of time based-data on it, which is projected to grow to 250 GB in size. The current implementation uses a single index for the time-based data, with 6 primary shards and 1 replica per primary. This decision was made based on reading that a single shard should aim for a maximum of 30GB in size.
Questions
Our old system had one node with too many indexes (over 100) and too many shards (over 1000), and it seems like our new one is being designed with too few (one index for 5+ years of data). It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month? That being said, according to another answer on SO the optimal number of indexes per node is 1, so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
How does one add a node to an ES deployment in Elastic Cloud? Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node. There is a slider which allows you to easily choose the memory of each node in a deployment (between 1GB and 250B), however I see no way to add multiple nodes, which is confusing because it seems like basic functionality for Elasticsearch.
Our test project's node has restarted several times, always when there is lots of old data on the node, and therefore memory pressure. The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes. I've asked their support about this, but just curious to see if anyone knows what could cause this and how to resolve it?
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node
Clearly, on a single node, you cannot have replicas. So your index should have been configured with 0 replicas and you can do it dynamically to get your cluster back to green (PUT index/_settings {"index.number_of_replicas": 0}), simple as that.
Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards)
I cannot tell if 50 new primary shards (10 index) per day were reasonable or not because you don't give any information regarding the volume of data in your test project. But it's probably too many.
It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month?
Having five years worth of data in a single index is perfectly possible, it doesn't really depend on how old the data is, but on how big it grows. You mention 250GB and also that you know a shard shouldn't grow over 30GB (and that again depends on the spec of your hardware underneath, more on that later), but since you have only 6 shards for that index, it means that each shard will grow over 40GB (which is ok according to this), but to be on the safe side, you should probably increase to 8-9 shards, or you split your data into yearly/monthly indices.
The 30GB-ish limit per shard is also dependent on how much heap your nodes have. If you have nodes with 2GB heap, then having 30GB shards is clearly too big. Since you're on ES Cloud and you plan to have 250GB of data, you must have chosen a node capacity of 16GB heap + 384GB storage (or bigger). So with 16GB heap, it's reasonable to have 30GB shards, but you'll need several nodes in my opinion. You can verify how many nodes you have using GET _cat/nodes?v.
That being said, according to another answer on SO the optimal number of indexes per node is 1...
What Chris is saying is a theoretical/ideal setting, which is almost never possible/advisable/desired to do in reality. You do want to have several shards in your index and the reason is that when your data grows, you want to be able to scale to more than one node, that's the whole point of ES, otherwise you'd be better off embedding the Lucene library directly in your project.
..., so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
First check how many nodes you have in your cluster using GET _cat/nodes?v, but clearly if you're assigned a single node for 250GB of data split on 6-8 shards, a single node is not ideal, indeed.
How does one add a node to an ES deployment in Elastic Cloud?
Right now, you can't. However, at the last Elastic{ON} conference, Elastic announced that it will be possible to pick the number of nodes or the kind of deployment (hot/warm, etc) you want to set up.
Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node.
You don't really need replicas in a test project, right?
The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
How did you delete the data? Between the time you deleted the data and before the node restarted, did you witness that the data was indeed gone?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes.
This is weird, since on ES cloud your cluster generally gets snapshotted every 30 minutes. What do you see under Deployments > cluster-id > Elasticsearch > Snapshots? What does the ES Cloud support say about it? What do you get when running GET _cat/repositories?v and GET _cat/snapshots/found-snapshots?v? (update your question with the results)

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

There is a great tutorial elasticsearch on ec2 about configuring ES on Amazon EC2. I studied it and applied all recommendations.
Now I have AMI and can run any number of nodes in the cluster from this AMI. Auto-discovery is configured and the nodes join the cluster as they really should.
The question is How to configure cluster in way that I can automatically launch/terminate nodes depending on cluster load?
For example I want to have only 1 node running when we don't have any load and 12 nodes running on peak load. But wait, if I terminate 11 nodes in cluster what would happen with shards and replicas? How to make sure I don't lose any data in cluster if I terminate 11 nodes out of 12 nodes?
I might want to configure S3 Gateway for this. But all the gateways except for local are deprecated.
There is an article in the manual about shards allocation. May be I'm missing something very basic but I should admit I failed to figure out if it is possible to configure one node to always hold all the shards copies. My goal is to make sure that if this would be the only node running in the cluster we still don't lose any data.
The only solution I can imagine now is to configure index to have 12 shards and 12 replicas. Then when up to 12 nodes are launched every node would have copy of every shard. But I don't like this solution cause I would have to reconfigure cluster if I might want to have more then 12 nodes on peak load.
Auto scaling doesn't make a lot of sense with ElasticSearch.
Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).
Also, if you care about your data you don't want to have only 1 node ever. You need your data to be replicated, so you will need at least 2 nodes (or more if you feel safer with a higher replication level).
Another thing to remember is that while you can change the number of replicas, you can't change the number of shards. This is configured when you create your index and cannot be changed (if you want more shards you need to create another index and reindex all your data). So your number of shards should take into account the data size and the cluster size, considering the higher number of nodes you want but also your minimal setup (can fewer nodes hold all the shards and serve the estimated traffic?).
So theoretically, if you want to have 2 nodes at low time and 12 nodes on peak, you can set your index to have 6 shards with 1 replica. So on low times you have 2 nodes that hold 6 shards each, and on peak you have 12 nodes that hold 1 shard each.
But again, I strongly suggest rethinking this and testing the impact of shard moving on your cluster performance.
In cases where the elasticity of your application is driven by a variable query load you could setup ES nodes configured to not store any data (node.data = false, http.enabled = true) and then put them in for auto scaling. These nodes could offload all the HTTP and result conflation processing from your main data nodes (freeing them up for more indexing and searching).
Since these nodes wouldn't have shards allocated to them bringing them up and down dynamically shouldn't be a problem and the auto-discovery should allow them to join the cluster.
I think this is a concern in general when it comes to employing auto-scalable architecture to meet temporary demands, but data still needs to be saved. I think there is a solution that leverages EBS
map shards to specific EBS volumes. Lets say we need 15 shards. We will need 15 EBS Volumes
amazon allows you to mount multiple volumes, so when we start we can start with few instances that have multiple volumes attached to them
as load increase, we can spin up additional instance - upto 15.
The above solution is only advised if you know your max capacity requirements.
I can give you an alternative approach using aws elastic search service(it will cost little bit more than normal ec2 elasticsearch).Write a simple script which continuously monitor the load (through api/cli)on the service and if the load goes beyond the threshold, programatically increase the nodes of your aws elasticsearch-service cluster.Here the advantage is aws will take care of the scaling(As per the documentation they are taking a snaphost and launching a completely new cluster).This will work for scale down also.
Regarding Auto-scaling approach there is some challenges like shard movement has an impact on the existing cluster, also we need to more vigilant while scaling down.You can find a good article on scaling down here which I have tested.If you can do some kind of intelligent automation of the steps in the above link through some scripting(python, shell) or through automation tools like Ansible, then the scaling in/out is achievable.But again you need to start the scaling up well before the normal limits since the scale up activities can have an impact on existing cluster.
Question: is possible to configure one node to always hold all the shards copies?
Answer: Yes,its possible by explicit shard routing.More details here
I would be tempted to suggest solving this a different way in AWS. I dont know what ES data this is or how its updated etc... Making a lot of assumptions I would put the ES instance behind a ALB (app load balancer) I would have a scheduled process that creates updated AMI's regularly (if you do it often then it will be quick to do), then based on load of your single server I would trigger more instances to be created from the latest instance you have available. Add the new instances to the ALB to share some of the load. As this quiet down I would trigger the termination of the temp instances. If you go this route here are a couple more things to consider
Use spot instances since they are cheaper and if it fits your use case
The "T" instances dont fit well here since they need time to build up credits
Use lambdas for the task of turning things on and off, if you want to be fancy you can trigger it based on a webhook to the aws gateway
Making more assumptions about your use case, consider putting a Varnish server in front of your ES machine so that you can more cheaply provide scale based on a cache strategy (lots of assumptions here) based on the stress you can dial in the right TTL for cache eviction. Check out the soft-purge feature for our ES stuff we have gotten a lot of good value from this.
if you do any of what i suggest here make sure to make your spawned ES instances report any logs back to a central addressable place on the persistent ES machine so you don't lose logs when the machines die

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Resources