best strategy for deploying mongodb on servers with multiple disks - performance

I am planning to setup a small mongodb testing environment which consists of 6 physical servers and each server has 4 SSDs. My expectation is to maximize both I/O throughput and disk space utilization. After some reading and searching, it seems that MongoDB is only able to configure a single data folder for each instance. There are 3 possible solutions in my mind:
create a big volume on each server with RAID (e.g., RAID5 or RAID10).
for each disk on each server, create a mongo shard instance (or shard's replica).
start mongo instance with "--directoryperdb" parameter to store databases in seperate folders, and then use Linux symbol link to point the database folders to other disks.
Which deployment strategy is the most recommended way in a heavy loaded mongo production environment? Thanks.

for each disk on each server, create a mongo shard instance (or shard's replica).
You make no mention of RAM in your post, but running multiple Mongo instances on the same server will result in them competing for RAM.
start mongo instance with "--directoryperdb" parameter to store databases in seperate folders, and then use Linux symbol link to point the database folders to other disks.
No need to mess around with symbolic links, just mount the disks to the directory where Mongo puts each database.
I would either go for option 1 (RAID) or option 3 (databases on separate disks), but in either case some benckmarks would be a good idea before going into production.

Related

How does MinIO deals with sharding?

I have been using MinIO for a while now, but I have always used it for smaller than 1GB files.
I have a MinIO setup with 4 VMs each one with 2 disks (each disks of 100GB), and I would like to know if there is any problem if I try to copy files bigger than 400GB and also I would like to know how does the Sharding works on MinIO.
I do know how to use and configure Sharding in SDS as Gluster, but I find no reference on the internet so far about MinIO.

change persistent disk type to ssd

I have an elasticsearch running as a ECK on a GKE cluster for production purposes and in order to increase its performance I'm thinking of changing the persistent disk type to ssd. I came accross solutions that incite the need to create a snapshot of the disk in GCE and then create another ssd disk with the data stored in the snapshot. I'm still concerned whether it still has a risk of data loss and if I create another disk will my elastic be able to match it or not as it is statefulset.
Since this is a production deployment I would advise to do as follows:
Create a volume snapshot (doc).
Set up a secondary cluster (doc).
Modify the deployment so that it uses an SSD (doc).
Deploy to the second cluster.
Once this new deployment has been fully tested you can switch over the traffic.

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Running multiple elasticsearch instances

I need to setup 2 Elasticsearch instances:
one for kibana logs (my separate application will throw logs at it)
one for search for my production application
My plan is to create a separate folders with elasticsearch in them. They dont talk to each other which means they are separate databases and if one goes down, the other still runs. Is this good solution or should I use only one elasticsearch folder with muliple elasticsearch.yaml configuration files? What is the best practice for multiple elasticsearch instances?
The best practice is to NOT run two Elasticsearch instances on the SAME server.
Your production search will probably need a lot of ram to work fast and stay responsive. You don't want your logging system interfere with that.

EBS for storing databases vs. website files

I spent the day experimenting with AWS for the first time. I've got an EC2 instance running and I mounted an Elastic Block Store (EBS) to keep the MySQL databases.
Does it make sense to also put my web application files on the EBS, or should I just deploy them to the normal EC2 file system?
When you say your web application files, I'm not sure what exactly you are referring to.
If you are referring to your deployed code, it probably doesn't make sense to use EBS. What you want to do is create an AMI with your prerequisites, then have a script to create an instance of that AMI and deploy your latest code. I highly recommend you automate and test this process as it's easy to forget about some setting you have to manually change somewhere.
If you are storing data files, that are modified by the running application, EBS may make sense. If this is something like user-uploaded images or similar, you will likely find that S3 gives you a much simpler model.
EBS would be good for: databases, lucene indexes, file based CMS, SVN repository, or anything similar to that.
EBS gives you persistent storage so if you EC2 instance fails the files still exist. Apparently their is increased IO performance but I would test it to be sure.
If your files are going to change frequently (like a DB does) and you don't want to keep syncing them to S3 (or somewhere else), then an EBS is a good way to go. If you make infrequent changes and you can manually (or scripted) sync the files as necessary then store them in S3. If you need to shutdown or you lose your instance for whatever reason, you can just pull them down when you start up the new instance.
This is also assuming that you care about cost. If cost is not an issue, using the EBS is less complicated.
I'm not sure if you plan on having a separate EBS for your DB and your web files but if you only plan on having one EBS and you have enough empty space on it for your web files, then again, the EBS is less complicated.
If it's performance you are worried about, as mentioned, it's best to test your particular app.
Our approach is to have a script pre-deployed on our AMI that fetches the latest and greatest version of the code from source control. That makes it very straightforward to launch new instances quickly, or update all running instances (we take them out of the load balancing rotation one at a time, run the script, and put them back in the rotation).
UPDATE:
Reading between the lines it looks like you're mounting a separate EBS volume to an instance-store backed instance. AWS recently introduced EBS backed instances that have a ton of benefits vs. the old instance-store ones. I still mount my MySQL data on a separate EBS partition, though, so that I can easily mount it to a different server if needed.
I strongly suggest an EBS backed instance with a separate EBS volume for the MySQL data.

Resources