Rsync checksum only for same size files - time

There's a bunch of threads regarding rsync checksum, but none seems addressing this need, which would be the most effective and fastest way to sync, at least in my case:
same time and same size ► skip file (no transfer, no checksum)
different sizes ► transfer file (no checksum)
different times and same size ► perform checksum ► transfer only if checksums differ
I noticed that the option --checksum can really take a long time to mirror a folder, if there are a lot of files. Using this option alone will run a checksum on every single file, which is very safe but very slow. Besides, it will induce read access overhead to compute the checksum.
The option --ignore-times is not what I want, if time and size both match, the chance that the files are different is insignificant, I'm willing to take the risk not to transfer.
The option --size-only is incomplete, as there is a good chance that files having same size but different times may actually be different files (eg. changing a char in another may not affect the size, just the time of edition).
Is there a way to perform the mirroring as per the combination above, with rsync (did I miss something in the manpages) or with any other Linux tools?
Thanks.

When determining whether to transfer files (or with --dry-run, whether to list files), rsync will always transfer files that differ in filesize. However, when files are the same size, rsync has several options:
with --size-only: never transfer files
with --ignore-times: always transfer files
default: if timestamps differ, transfer files
with --checksum: calculate checksums and transfer files if they differ
The behavior that you want would be a combination of the last two:
"if timestamps differ, calculate checksums and transfer files if the checksums differ as well".
This is not currently an option in rsync.
Unfortunately, looking at the rsync source-code, it appears it would be non-trivial to add this functionality. Currently, if checksums are used, the remote rsync gathers size, timestamp and checkstum information and sends them all together. The desired behavior would require that the remote rsync first sends over the size and timestamp, and when the local rsync determines that a checksum is needed, returns to the file to get the checksum. But the whole "remote rsync returns to the file" aspect is not present in the current code, and would first need to be written.
When you run an actual transfer, the second step can effectively be done during the transfer-process: transfer of files that do not differ is very efficient. So then the default behaviour of rsync would suffice. When using --dry-run the best approach would probably be to run rsync with default behaviour first, gather the --dry-run output, and then run rsync again, with --checksum, on the files found in the first run.

The short answer... it does.
same time and same size ► skip file (no transfer, no checksum)
Good and fast, but not exact, rsync offers that by default. The file could be modified and the time / size are still the same. (times can be reset) You can use -c if paranoid.
different sizes ► transfer file (no checksum)
Simplistic... what if it's a 2 gig file... and the only difference is 1 line at the end? The checksum can figure that out and spare the network traffic. You can use -c if you trust the time/size comparison.
different times and same size ► perform checksum ► transfer only if checksums differ
Of course.
I don't see it, but I remember rsync used to have an issue if there were over ... I think it was around 130,000 files. Maybe that issue was fixed.
If you do have that many files in one directory you probably have bigger problems... spread them out over different directories and do multiple rsyncs on those multiple directories.
Lots of small files (on most filesystems) have a lot of internal fragmentation issues and you might be better off archiving the files and rsyncing the archive... you need an archiver that allows updating the archive rather than re-creating it all the time.
Maybe, if not a lot of these files are updated... find the ones changed after a date (find --newer file) and then rsync just those files. (if you trust the times)
Why was this question ignored so long?

Related

Break and combine torrent file

Is there a way I can break a big torrent file (say, more than 4 GB) into small parts (say, less than 1 GB each) and download them separately ?
I should later be able to combine them and the final result should be the same as if I had downloaded the whole torrent file in one go.
Yes.
A torrent file contains modular information about the file it's tracking as 'pieces', which are slices of the entire file you're downloading, pieces are generally between 16KB and 4MB.
So, in essence, a torrent is downloaded as n parts and combined later as it is, all you'd need is a client that would let you specify that you only want the first, second, third or fourth quarter of the pieces, which is where you'll run into some trouble, I don't know if there's any client that would support such a thing.
In addition to this answer, it's possible to split a file in smaller chunks with a file archiver or similar software.
You can then send those files with torrents.

The max number of files in one directory?

Is there any limitation on number of files in one directory (in any host) ?
If I have a directory with 30k (named from 1 to 3ok) files and another one with only 10 is there a major difference in performance to fetch a specific files ?
thanx
it depends from your file system type. Answer on this question will be find out in your current file system type spec.
Archlinux Wiki perfomance optimizing page
Summary:
XFS: Excellent performance with large files. Low speed with small files. A good choice for /home. -Reiserfs: Excellent performance with small files. A good choice for /var.
Ext3: Average performance, reliable.
Ext4: Great overall performance, reliable,has performance issues with sqlite and some other databases.
JFS: Good overall performance, very low CPU usage, extremely fast resume after power failure.
Btrfs: Probably best overall performance (with compression) and lots of features. Still in heavy development and fully supported, but considered as unstable. Do not use this filesystem yet unless you know what you are doing and are prepared for potential data loss.
fsck time vs Inode Count
I'd say the max number of files is OS specific and file system specific. But having a huge number ob files in one directory can drastically hit your performance, when accessing a file.
I can not give you any numbers for any specific os/fs, but maybe a solution if you have performance issues:
In mediawiki software (thats the software wikipedia runs on) they use subdirectories to counter that problem. This is how they store media files:
md5-hash the name of the file
take the first digit of the md5hash as subdirectory of the files dir
take the first 2 digits of the md5hash as name of a subsubdirectory to that subdirectory
store the file there
this way they can find the file by the name only, but don't need to rely on a good os/fs for sillions of files. It results in something like this:
http://upload.wikimedia.org/wikipedia/commons/7/74/Flag_of_Hamburg.svg is the path for Flag_of_Hamburg.svg

Performance when accessing file system metadata concurrently

I have to collect certain attributes of files (modification date and so on). But there are many small files to analyze.
My question is: would it be more performant if I read, say, 3 or 4 files at the same time? If you access a file on the web this is better since you have to wait for the server to respond. But what about a harddisk? Is the concurrent strategy faster if the files are already cached by the harddisk?
You are accessing metadata, it seems (mtime), which are stored in the file's inode and therefore in the file system. Your limiting factor should (in UNIX terms) be the syscall to get the stat information, which could profit from parallelization.

Optimizing file reading from HD

I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList is a list of more than 1 million small files. Empirically, I have found that call to open(fileName) takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Solution
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.
Hardware solution
You should really benefit from using a solid state drive (SSD). These are a lot faster than traditional hard disk drives, because they don't have any hardware components that need to spin and move around.
Software solution
Are these files under your control, or are they coming from an external system? If you're in control, I'd suggest you use a database to store the information.
If a database is too much of a hassle for you, try to store the information in a single file and read from that. If the isn't fragmented too much, you'll have much better performance compared to having millions of small files.
If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files
To address your final point:
unless someone here has a strong belief that such a change will make a huge difference
If we're really talking about 1 million small files, merging them into one large file (or a small number of files) will almost certainly make a huge difference. Try it as an experiment.
Store the files in a single .zip archive and read them from that. You are just reading these files, right?
So, let's get this straight: you have sound empirical data that shows that your bottleneck is the filesystem, but you don't want to change your file structure? Look up Amdahl's law. If opening the files takes 90% of the time, then without changing that part of the program, you will not be able to speed things up by more than 10%.
Take a look at the properties dialog box for the directory containing all those files. I'd imagine the "size on disk" value is much larger than the total size of the files, because of the overhead of the filesystem (things like per-file metadata that is probably very redundant, and files being stored with an integer number of 4k blocks).
Since what you have here is essentially a large hash table, you should store it in a file format that is more suited to that kind of usage. Depending on whether you will need to modify these files and whether the data set will fit in RAM, you should look in to using a full-fledged database, a ligheweight embeddable one like sqlite, your language's hash table/dictionary serialization format, a tar archive, or a key-value store program that has good persistence support.

Performance of one huge unix directory VS a directory tree?

My PHP project will use thousands of pictures and each needs only a single number for it's storage name.
My initial idea was to put all of the pictures in a single directory and name the files "0.jpg", "1.jpg", "2.jpg", and all the way to "4294967295.jpg" .
Would it be better performance-wise to create a directory tree structure and name the files something like "429 / 496 / 7295.jpg"?
If the answer is yes, then the follow up question would be: what is the optimal amount of subdirs or files per level of depth? And what effect does the chosen filesystem have on this?
Each file will have a corresponding MySQL entry with an UNSIGNED LONGINT id-number.
Thank you.
Yes, hard-to-say, quite a bit, perhaps you should use a database
The conventional wisdom is "use a database", but using the filesystem is a reasonable plan for larger objects like images.
Some filesystems have limits on the number of directory entries. Some filesystems do not have any sort of data structure for filename lookups, but just do a linear scan of the directory.
Optimizations like you are discussing are restricted to specific environmental profiles. Do you even know right now what future hardware your application will run on? Might it be a good idea to not stress the filesystem and make a nice, hierarchical directory structure? If you do that it will run well on any filesystem or storage server.
It depends on which filesystem is being used. ext{2,3,4} have a dir_index option that can be set when they are created that make storing thousands or even millions of files in a single directory reasonably fast.
btrfs is not yet production ready, but it implicitly supports this idea at a very basic level.
But if you're using the ext series without dir_index or most other Unix filesystems you will need to go for the more complex scheme of having several levels of directories. I would suggest you avoid that if you can. It just adds a lot of extra complication for something filesystems ought to be handling reasonably for you.
If you do use the more complex scheme, I would suggest encoding the number in hex and having 256 files/directories at each level. Filesystems that aren't designed to handle large numbers of files in each directory typically do linear scans. The goal is to approximate a B-Tree type structure on your own. 2 hex digits at each level gives you about half a 4kiB (a common size) disk block per level with common means of encoding directories. That's about as good as you're going to get without a really complicated scheme like encoding your numbers in base 23 or base 24.
Having several thousands files in one directory will slow things down considerably. I'd say a safe number is up to 1024 files per directory, 512 even better.
The answer, of course, is: It depends.
In particular, it depends on which file system you use. For example, the ext2 and ext3 file systems have a limits to the number of files per directory. Those file systems would not be able to put all of your pictures in one directory!
You might look into something other than a file system. In the company I work for, because we needed to store lots of material, we moved from file-based storage to a database-based storage run on Apache Jackrabbit.

Resources