Fast list of top level directories in a tar - performance

I have a tar file that is over 100gb in size, with thousands of files in, and I'm trying to find a way to list the top level but fast.
I've seen people using --exclude and grep but this still has to go through all of the file names first, which in my case take a very long time.
Is there a way to do this?

Related

Recursive copy to a flat directory

I have a directory of images, currently at ~117k files for about 200 gig in size. My backup solution vomits on directories of that size, so I wish to split them into subdirectories of 1000. Name sorting or type discrimination is not required. I just want my backups to not go nuts.
From another answer, someone provided a way to move files into the split up configuration. However, that was a move, not a copy. Since this is a backup, I need a copy.
I have three thoughts:
1. Files are added to the large directory with random filenames, so alpha sorts aren't a practical way to figure out deltas. Even using a tool like rsync, adding a couple hundred files at the beginning of the list could cause a significant reshuffle and lots of file movement on the backup side.
2. The solution to this problem is to reverse the process: Do an initial file split, add new files to the backup at the newest directory, manually create a new subdir at the 1000 file mark, and then use rsync to pull files from the backup directories to the work area, eg rsync -trvh <backupdir>/<subdir>/ <masterdir>.
3. While some answers to similar questions indicate that rsync is a poor choice for this, I may need to do multiple passes, one of which would be via a slower link to an offsite location. The performance hit of using rsync and its startup parsing is far superior to the length of time reuploading the backup on a daily basis would take.
My question is:
How do I create a script that will recurse into all 117+ subdirectories and dump the contained files into my large working directory, without a lot of unnecessary copying?
My initial research produces something like this:
#!/bin/bash
cd /path/to/backup/tree/root
find . -type d -exec rsync -trvh * /path/to/work/dir/
Am I on the right track here?
It's safe to assume modern versions of bash, find, and rsync.
Thanks!

How to copy updated files in batches of limited size?

I'm on a Mac, and I want to copy, incrementally, files that have changed, to a new location. An rsync kind of arrangement. But, because of what happens to them next, I want to limit how many GB get copied in each run of the job.
If, say, I run it each night from a cron job, the amount of data has some kind of constraint, though it doesn't need to be very precise.
Ideally, it would be something like "Copy any new files in directory tree A to directory tree B, but stop after the first file that takes you over 2GB copied in this run". (I do want all files to be copied eventually, so a single large one shouldn't kill it.)
I could obviously write some code to do this, but wondered if there were any apps, obscure rsync options, ingenious uses of tar or cpio, that would save me the trouble...
(For the curious, I'm copying to my FileTransporter, which then syncs to one in my brother's house, but he has a slower connection than me and I don't want any given change to take more than a couple of hours each night.)

Percentage Difference Between Remote Directories

I am writing a script using rsync that synchronizes a remote directory with a local one without deleting any files. This is fairly simple, but I would like to be able to track progress, in simple terms without flooding a log file with the output of the --progress option in rsync.
To this end I was wondering if there is a way to easily (i.e. without consuming a lot of bandwidth) calculate the difference between two directories recursively. One directory should be remote, and the other local. We are looking for the amount of data that exists in the remote file but not in the local directory keeping in mind the fact that there may be data in the local directory that is not in the remote directory and we do not want that factored into the difference. Ideally, it should say something like: Synchronizing /remote/dir -> /local/dir: 12MiB/24MiB (50%) Done
If anyone knows an easy way of accomplishing this with a bash script that would be really helpful.
As always, thanks in advance for your help.

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?
You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.
Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers
The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)

Resources