About creating replicas of couchbase buckets - view

Will the creation of a copy of the couchbase affect the use of the view? How long will it take to create a replica of 1.1 billion pieces of data?
It took 24 hours to create a view on a bucket with 1.1 billion pieces of data. I'm afraid that creating a replica will cause the view to be recreated

Related

Elasticsearch Shard distribution size differs enormously

I need to load 1.2 billion documents in the elasticsearch. As of today we have 6 nodes in the cluster. To equally distribute the shards among the 6 nodes I have mentioned the number of shards to be 42. I use spark and it takes me almost 3 days load the index. The shards distribution looks so off.
The node6 only has two shards in it while node 2 has almost 10 shards. The size distribution is also not even. Some shards are 114.6gb while some are just 870mb within the same node.
I have tried to figure out the solution too. I can include the
index.routing.allocation.total_shards_per_node: 7
while creating the index and make it evenly distribute. Will forcing the designated amount of shards in the node, crash the node if there is not enough resource available?
I want to size the shards evenly. My index size is 900 gb apprx. I want each shards to be atleast 20 gb. Could I use the following setting while creating the index?
max_primary_shard_size: 25gb
Is setting up max shard size only possible through ilm policy and will I require roll over policy for that ? I am not too familiar with the ilm. Sorry if this does not make sense.
The main reason I am trying to optimize the index is because I am getting timeout error on my application when I am querying the elastic search. I know I can increase my timeout time in my application and do some query optimization, but first I want to optimize my index and make my application as fast as possible.
I load the index only one time and do not write any documents to it after onetime load. For additional data, which i load every 15 days, I create a different index and use an alias name on the both the indexes to query. Other than sharding if there is any suggestion to optimize my indexes I will really appreciate it. It takes me 3 days just to load the data so it is quite difficult to experiment.
are you using custom routing values in your indexing approach? that might explain the shard size differences.
and if you aren't already, disable replicas and refreshes when doing your bulk index, as that will speed things up
finally your shard size of 20gig is probably a little low, I would suggest doubling that size, aiming for <50gig

Elasticsearch Segments

I have an index with 12 shards allocated in 3 nodes. The index has around 5 millions of elements. Yesterday, these 5 milions of elements had a size of around 35Gb, and if I could reindex all these elements with the new reduced size of each of them, the total size of the index would be around 10Gb.
The problem is that, if I reindex these items inside the existing items, by using upserts, so updating them all, the index size does not reduce at all.
I've been checking the _forcemerge Elasticsearch API, but ATM I'm not using any kind of replicas, and they don't really recommend using this API in a master set of shards.
I'm learning the insides of Elasticsearch, and I'm actually working with a PROD environment, so I can't really clean-and-reindex.
Can anyone send me some light?

Elasticsearch - general architecture and Elastic Cloud questions

Background
We're designing the architecture of a new system using Elasticsearch now, and we plan to use Elastic Cloud based on reviews contrasting their service with AWS's, and self-hosting on an EC2 instance. As we design the system, I'm trying to learn from a small test project my team deployed on Elastic Cloud 6 months ago. While I've spent a lot of time reading the Elasticsearch Docs, Elasticsearch: The Definitive Guide, and Elastic Cloud's Docs, there are some concepts here that I'm still not understanding.
Our Test Project's issues
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node, currently with 2GB of memory. Because there is only one node, and because replica shards are never assigned to the same node as their primary shard (reason 2), none of the replicas are getting assigned. Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards), and over time, the proverbial Kagillion Shards. This system was only ever meant to have several months of data on it at a time, so the solution has been to manually delete old data when memory on this deployment runs out.
The New System
Our new system is meant to have 5 years worth of time based-data on it, which is projected to grow to 250 GB in size. The current implementation uses a single index for the time-based data, with 6 primary shards and 1 replica per primary. This decision was made based on reading that a single shard should aim for a maximum of 30GB in size.
Questions
Our old system had one node with too many indexes (over 100) and too many shards (over 1000), and it seems like our new one is being designed with too few (one index for 5+ years of data). It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month? That being said, according to another answer on SO the optimal number of indexes per node is 1, so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
How does one add a node to an ES deployment in Elastic Cloud? Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node. There is a slider which allows you to easily choose the memory of each node in a deployment (between 1GB and 250B), however I see no way to add multiple nodes, which is confusing because it seems like basic functionality for Elasticsearch.
Our test project's node has restarted several times, always when there is lots of old data on the node, and therefore memory pressure. The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes. I've asked their support about this, but just curious to see if anyone knows what could cause this and how to resolve it?
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node
Clearly, on a single node, you cannot have replicas. So your index should have been configured with 0 replicas and you can do it dynamically to get your cluster back to green (PUT index/_settings {"index.number_of_replicas": 0}), simple as that.
Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards)
I cannot tell if 50 new primary shards (10 index) per day were reasonable or not because you don't give any information regarding the volume of data in your test project. But it's probably too many.
It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month?
Having five years worth of data in a single index is perfectly possible, it doesn't really depend on how old the data is, but on how big it grows. You mention 250GB and also that you know a shard shouldn't grow over 30GB (and that again depends on the spec of your hardware underneath, more on that later), but since you have only 6 shards for that index, it means that each shard will grow over 40GB (which is ok according to this), but to be on the safe side, you should probably increase to 8-9 shards, or you split your data into yearly/monthly indices.
The 30GB-ish limit per shard is also dependent on how much heap your nodes have. If you have nodes with 2GB heap, then having 30GB shards is clearly too big. Since you're on ES Cloud and you plan to have 250GB of data, you must have chosen a node capacity of 16GB heap + 384GB storage (or bigger). So with 16GB heap, it's reasonable to have 30GB shards, but you'll need several nodes in my opinion. You can verify how many nodes you have using GET _cat/nodes?v.
That being said, according to another answer on SO the optimal number of indexes per node is 1...
What Chris is saying is a theoretical/ideal setting, which is almost never possible/advisable/desired to do in reality. You do want to have several shards in your index and the reason is that when your data grows, you want to be able to scale to more than one node, that's the whole point of ES, otherwise you'd be better off embedding the Lucene library directly in your project.
..., so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
First check how many nodes you have in your cluster using GET _cat/nodes?v, but clearly if you're assigned a single node for 250GB of data split on 6-8 shards, a single node is not ideal, indeed.
How does one add a node to an ES deployment in Elastic Cloud?
Right now, you can't. However, at the last Elastic{ON} conference, Elastic announced that it will be possible to pick the number of nodes or the kind of deployment (hot/warm, etc) you want to set up.
Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node.
You don't really need replicas in a test project, right?
The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
How did you delete the data? Between the time you deleted the data and before the node restarted, did you witness that the data was indeed gone?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes.
This is weird, since on ES cloud your cluster generally gets snapshotted every 30 minutes. What do you see under Deployments > cluster-id > Elasticsearch > Snapshots? What does the ES Cloud support say about it? What do you get when running GET _cat/repositories?v and GET _cat/snapshots/found-snapshots?v? (update your question with the results)

What is the best way to store big data and create instant search with ES?

I am working on a project that will store million of data per day. So I want to store it in compressed structure(only searchable field and removing unwanted fields) to elastic search for instant text search. But I want the uncompressed data to be stored for later process and analytics. it should have more write speed and Cheaper to store billions of data.
Elasticsearch allows you to decide, per index, where to store it (via shard allocation) and what kind of compression you would like to use (via index codec).
So with unlimited resources and time, you could design a process where you index documents into daily indices for example, on a 5 node cluster where you keep the last 7 days on 3 of the servers (let's call these the fast servers) and anything older than that will be kept on the 2 slower servers, that way queries ranged on the last 7 days will run faster while jobs that are not time-sensitive can run on the older indices which are stored on the slower servers.
The fast servers could have more computing power and faster SSD disks while the slower servers will have normal spinning disks.
Regarding compression, Elasticsearch compression works on the _source data, so compression should not affect aggregation speed, its also important to note that if you change the index compression it will only apply to new/updated documents and will not run retroactively on documents that you've indexed in the past.

elasticsearch: is creating one index for each log good?

I am using elasticsearch to index logs from an automation run of test cases. I am creating an index for each of the runs (that can have from 1000 to million events). I create about 200 indices per day. Is this a good methodology to create an index for each run or should I just have 1 index and then put all the logs from multiple runs into this index?
The amount of data is huge and so I chose separate indices. I am expecting 200 logs everyday each with 1million events. Please help me
Depends how long you want to retain your data and the size of your cluster. At 200 indices per day, each with lots of associated files, you're looking at a lot of file handles. So, that doesn't sound like it would scale beyond a few weeks or months on a very small cluster since you'll be running out of file handles.
A better strategy might be to do what logstash does by default which is to create a new index every day. Then your next choice will be to play with the number of shards and nodes in the cluster. Assuming you want to store a worst case of 200M log entries per day on a 3 or 5 node cluster, probably the default of 5 shards is fine. If you go for more nodes, you'll probably want more shards so that each shard is smaller. Also consider using elasticsearch curator to e.g. close older indices and optimize them.

Resources