i was under the impression that minio is well suited for small file storage and read (https://blog.min.io/minio-optimizes-small-objects/) , i finally migrated my 2 million small text files but the read speed surprisengly slower than directly from the disk ... is there a way to compact/merge those small files ? or is there something that i am doing wrong ...
my usual use case : reading a 10 000 random read files
when it was directly from the disk i average around 120 seconds
i transfered then to a local network solution : it took a round 500-600 seconds to read
now with minio its around 600 seconds
RQ : (
the disk is capable of outputting greater speed but in large files also for minio it works great with large files
)...
do you guys have any idea ... i am really stuck :(
minio was never a good system to deal with that kinda of problem i think u just need to get a faster hardware (drive) ssd should work fine for u
We have a weekly process that archives a big number of frequently changing files into a single tar file and synchronizes it to another host using rsync as following (resulting in a very low speedup metric, usually close to 1.00):
rsync -avr <src> <dst>
Over the years, this archive has steadily increased in size and is now over 200G large. With the increasing file size, rsync has come to a point where it takes about 20 hours to finish the synchronization. However, deleting the file at the destination before the rsync process starts, causes the transfer to complete in only about 1 hour.
I understand that rsync's delta-transfer algorithm introduces some overhead, but it seems that it is not linear but exponentially growing with very large file sizes. If the actual transfer of bytes over the network takes 1h, what exactly is rsync doing in the remaining 19h?
Situation:
To replace an 10+ year old Windows 2000 2-Node cluster with shared MSA SCSI storage with a newer Windows 2003 2-Node cluster with shared FC storage.
The shared storage is current split into two drives X(data) and Q(quorum).
The X Drive consists of a Flat File DB consisting of 13.1 million+ files in 1.3 million+ folders. These files need to be copied from the old cluster to the new cluster with minimal down time.
File Count: 13,023,328
Total Filesize: 8.43 GB (File Size not Size on Disk)
Folder Count: 1308153
The old Win 200 Cluster has been up for over 10 years, continually reading/writing and is now also heavily fragmented. The X Drive on the Win 2000 Cluster also contains 7 backups of the DB, which are created/updated via Robo Copy once per day, this currently takes 4-5 hours and adds a real lag to system performance.
Old Cluster
- 2 x HP DL380 G4 |
1 x HP MSA 500 G2 (SCSI) | Raid 5 (4 disks + Spare)| Win 2k
New Cluster
- 2 x HP DL380 G7 |
1 x HP StorageWorks P2000 G2 MSA (Fibre Channel) | Win 2k3
The Database can be offline for 5 to 8 hours comfortably, and 15 hours absolute maximum, due to the time sensitive data it provides.
Options We've Tried:
Robo / FastCopy both seemed to sit around 100-300 files copied per second, with the database offline.
Peersync Copy from a local node backup (D: drive), this completed in 17 hours with an average of 250 files per second.
Question/Options:
Block by Block Copy - We think might be the fastest, but it will also copy the backups from the original X drive.
Redirect Daily Backup - Redirect the daily backup from the local X Drive to a network share of the new X Drive. Slow to begin with, but will then only be up to 12 hours out of date when we come to switch over, as it could be run while the old system is live. Final Sync on the move day, should take no more than 10 hours, to 100% confirm the old and new systems are identical.
Custom Copy Script - We have access to C# and Python
Robo/Fast Copy/ Other File Copy, open to suggestions and settings
Disk Replace / Raid Rebuild - The risky or impossible option, replace each of the older disks, with a new smaller form factor disk, in old G2 caddy, allow raid to rebuild, replace and rebuild until all drives are replaced. On day of migration, move the 4 disks to new P2000 MSA, in the same raid order?
Give Up - And leave it running on the old hardware until it dies a fiery death.*
We seem to be gravitating to Option 2, but thought we should put this to some of the best minds in the world before committing.
ps. Backups on the new cluster are to a new (M) drive using Shadow Copy.
* Unfortunately not a real option, as we do need to move to the newer hardware as the old storage and clustercan no longer cope with demand.
We went with Option 2, and redirected the twice daily backup from the original cluster to the new MSA raid on the new cluster.
It was run as a pull from the new cluster using PeerSync and a Windows share on the old cluster.
We tried to use the PeerSync TCP client which would have been faster / more efficient, but it wasn't compatible with Windows 2000. PeerSync was chosen over most other copy tools out there due to its compatibility and non-locking file operations, allowing the original cluster to be online throughout with minimal performance impact.
This took around 13.5 hours for the initial copy, and then around 5.5 hours for the incremental diff copies. The major limiting factor was the original clusters shared MSA RaidSet, the drives were online and being access through the backups, so the normal operation slowed down the backup times.
The final sync took about 5 hours and that was the total time the database was offline, for the hardware upgrade.
Is it true that large directories can cause the increased I/O wait ?
I was told to put not more than 1000 images per directory.
Thanks.
Yes!!!
you can divide files in sub directories under main directory... it make IO to find files faster...
max file limit may differ with file system you are using you can find here
you can divide your file/folder structure as per need.
1000 to 3000 is good number if you are going to have a lot of files
I am working on a cluster where I submit jobs through the qsub engine.
I am granted a maximum of 72h of computational time at once. The output of my simulation is a folder which typically contains about 1000 files (about 10 Gb). I copy my output back after 71h30m of simulation. This means that everything that is produced after 71h30m (+ time to copy?) is lost. Is there a way to make the process more efficient, that is not having to manually estimate the time needed to copy output back?
Also before copying back my output I compress files with bzip2, what resources are used to do that? Should I ask a 1 node more than what I need to run the simulation only to compress files?