How does MinIO deals with sharding? - minio

I have been using MinIO for a while now, but I have always used it for smaller than 1GB files.
I have a MinIO setup with 4 VMs each one with 2 disks (each disks of 100GB), and I would like to know if there is any problem if I try to copy files bigger than 400GB and also I would like to know how does the Sharding works on MinIO.
I do know how to use and configure Sharding in SDS as Gluster, but I find no reference on the internet so far about MinIO.

Related

Image pull over multiple K8s nodes

When I create a pod, a corresponding image is pulled to the node where the pod is created
Can I have those images shared among the cluster nodes, instead of being stored locally on each node?
Thanks a lot
Best Regards
It's possible if you have shared storage across all the Kubernetes nodes. However, it's not a good idea 🙅 since typically the place where images get stored is also the place where the container runtime stores its files when it's actually running the container. For example, if you are using Docker, everything gets stored under /var/lib/docker or in the case of containerd it's /var/lib/containerd
So in summary, it's possible with shared files/cluster file systems like NFS, Ceph, Glusterfs, AWS EFS, etc, but it's not a good idea in my opinion 🚫.
Update (#BMitch):
Make sure that the container storage driver you are using supports the filesystem that you are using.
✌️

How to remove duplicate files using Apache Nifi?

I have a couple of EC2 servers set-up, with the same EFS mounted on each of these instances.
Have also setup Apache Nifi independently on each of the 2 machines. Now, when I try to make a data flow to copy files from the EFS mounted folder, I get duplicated files on both the servers.
Is there some way in Apache Nifi using which I can churn out duplicate items, since both of them are firing at the same time. Cron is not useful enough as at some point the servers will collide at the same time.
For Detecting Duplicate file you can use DetectDuplicate Processor.

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

best strategy for deploying mongodb on servers with multiple disks

I am planning to setup a small mongodb testing environment which consists of 6 physical servers and each server has 4 SSDs. My expectation is to maximize both I/O throughput and disk space utilization. After some reading and searching, it seems that MongoDB is only able to configure a single data folder for each instance. There are 3 possible solutions in my mind:
create a big volume on each server with RAID (e.g., RAID5 or RAID10).
for each disk on each server, create a mongo shard instance (or shard's replica).
start mongo instance with "--directoryperdb" parameter to store databases in seperate folders, and then use Linux symbol link to point the database folders to other disks.
Which deployment strategy is the most recommended way in a heavy loaded mongo production environment? Thanks.
for each disk on each server, create a mongo shard instance (or shard's replica).
You make no mention of RAM in your post, but running multiple Mongo instances on the same server will result in them competing for RAM.
start mongo instance with "--directoryperdb" parameter to store databases in seperate folders, and then use Linux symbol link to point the database folders to other disks.
No need to mess around with symbolic links, just mount the disks to the directory where Mongo puts each database.
I would either go for option 1 (RAID) or option 3 (databases on separate disks), but in either case some benckmarks would be a good idea before going into production.

How to rsync to all Amazon EC2 servers?

I have a Scalr EC2 cluster, and want an easy way to synchronize files across all instances.
For example, I have a bunch of files in /var/www on one instance, I want to be able to identify all of the other hosts, and then rsync to each of those hosts to update their files.
ls /etc/aws/hosts/app/
returns the IP addresses of all of the other instances
10.1.2.3
10.1.33.2
10.166.23.1
Ideas?
As Zach said you could use S3.
You could download one of many clients out there for mapping drives to S3. (search for S3 and webdav).
If I was going to go this route I would setup an S3 bucket with all my shared files and use jetS3 in a cronJob to sync each node's local drive to the bucket (pulling down S3 bucket updates). Then since I normally use eclipse & ant for building, I would create a ANT job for deploying updates to the S3 bucket (pushing updates up to the S3 bucket).
From http://jets3t.s3.amazonaws.com/applications/synchronize.html
Usage: Synchronize [options] UP <S3Path> <File/Directory>
(...)
or: Synchronize [options] DOWN
UP : Synchronize the contents of the Local Directory with S3.
DOWN : Synchronize the contents of S3 with the Local Directory
...
I would recommend the above solution, if you don't need cross-node file locking. It's easy and every system can just pull data from a central location.
If you need more cross-node locking:
An ideal solution would be to use IBM's GPFS, but IBM doesn't just give it away (at least not yet). Even though it's designed for High Performance interconnects it also has the ability to be used over slower connections. We used it as a replacement for NFS and it was amazingly fast ( about 3 times faster than NFS ). There maybe something similar that is open source, but I don't know. EDIT: OpenAFS may work well for building a clustered filesystem over many EC2 instances.
Have you evaluated using NFS? Maybe you could dedicate one instance as an NFS host.

Resources