Recursive copy to a flat directory - bash

I have a directory of images, currently at ~117k files for about 200 gig in size. My backup solution vomits on directories of that size, so I wish to split them into subdirectories of 1000. Name sorting or type discrimination is not required. I just want my backups to not go nuts.
From another answer, someone provided a way to move files into the split up configuration. However, that was a move, not a copy. Since this is a backup, I need a copy.
I have three thoughts:
1. Files are added to the large directory with random filenames, so alpha sorts aren't a practical way to figure out deltas. Even using a tool like rsync, adding a couple hundred files at the beginning of the list could cause a significant reshuffle and lots of file movement on the backup side.
2. The solution to this problem is to reverse the process: Do an initial file split, add new files to the backup at the newest directory, manually create a new subdir at the 1000 file mark, and then use rsync to pull files from the backup directories to the work area, eg rsync -trvh <backupdir>/<subdir>/ <masterdir>.
3. While some answers to similar questions indicate that rsync is a poor choice for this, I may need to do multiple passes, one of which would be via a slower link to an offsite location. The performance hit of using rsync and its startup parsing is far superior to the length of time reuploading the backup on a daily basis would take.
My question is:
How do I create a script that will recurse into all 117+ subdirectories and dump the contained files into my large working directory, without a lot of unnecessary copying?
My initial research produces something like this:
#!/bin/bash
cd /path/to/backup/tree/root
find . -type d -exec rsync -trvh * /path/to/work/dir/
Am I on the right track here?
It's safe to assume modern versions of bash, find, and rsync.
Thanks!

Related

Fast list of top level directories in a tar

I have a tar file that is over 100gb in size, with thousands of files in, and I'm trying to find a way to list the top level but fast.
I've seen people using --exclude and grep but this still has to go through all of the file names first, which in my case take a very long time.
Is there a way to do this?

How should I mark a folder as processed in a script?

A script shall process files in a folder on a Windows machine and mark it as done once it is finished in order to not pick it up in the next round of processing.
My tendency is to let the script rename the folder to a different name, like adding "_done".
But on Windows, renaming a folder is not possible if some process has the folder or a file within it open. In this setup, there is a minor chance that some user may have the folder open.
Alternatively I could just write a stamp-file into that folder.
Are there better alternatives?
Is there a way to force the renaming anyway, in particular when it is on a shared drive or some NAS drive?
You have several options:
Put a token file of some sort in each processed folder and skip the folders that contain said file
Keep track of the last folder processed and only process ones newer (Either by time stamp or (since they're numbered sequentially), by sequence number)
Rename the folder
Since you've already stated that other users may already have the folder/files open, we can rule out #3.
In this situation, I'm in favor of option #1 even though you'll end up with extra files, if someone needs to try and figure out which folders have already been processed, they have a quick, easy method of discerning that with the naked eye, rather than trying to find a counter somewhere in a different file. It's also a bit less code to write, so less pieces to break.
Option #2 is good in this situation as well (I've used both depending on the circumstances), but I tend to favor it for things that a human wouldn't really need to care about or need to look for very often.

How to copy updated files in batches of limited size?

I'm on a Mac, and I want to copy, incrementally, files that have changed, to a new location. An rsync kind of arrangement. But, because of what happens to them next, I want to limit how many GB get copied in each run of the job.
If, say, I run it each night from a cron job, the amount of data has some kind of constraint, though it doesn't need to be very precise.
Ideally, it would be something like "Copy any new files in directory tree A to directory tree B, but stop after the first file that takes you over 2GB copied in this run". (I do want all files to be copied eventually, so a single large one shouldn't kill it.)
I could obviously write some code to do this, but wondered if there were any apps, obscure rsync options, ingenious uses of tar or cpio, that would save me the trouble...
(For the curious, I'm copying to my FileTransporter, which then syncs to one in my brother's house, but he has a slower connection than me and I don't want any given change to take more than a couple of hours each night.)

command line wisdom for 2 panel file manager user

Want to upgrade my file management productivity by replacing 2 panel file manager with command line (bash or cygwin). Can commandline give same speed? Please advise a guru way of how to do e.g. copy of some file in directory A to the directory B. Is it heavy use of pushd/popd? Or creation of links to most often used directories? What are the best practices and a day-to-day routine to manage files of a command line master?
Can commandline give same speed?
My experience is that commandline copying is significantly faster (especially in the Windows environment). Of course the basic laws of physics still apply, a file that is 1000 times bigger than a file that copies in 1 second will still take 1000 seconds to copy.
..(howto) copy of some file in directory A to the directory B.
Because I often have 5-10 projects that use similar directory structures, I set up variables for each subdir using a naming convention :
project=NewMatch
NM_scripts=${project}/scripts
NM_data=${project}/data
NM_logs=${project}/logs
NM_cfg=${project}/cfg
proj2=AlternateMatch
altM_scripts=${proj2}/scripts
altM_data=${proj2}/data
altM_logs=${proj2}/logs
altM_cfg=${proj2}/cfg
You can make this sort of thing as spartan or baroque as needed to match your theory of living/programming.
Then you can easily copy the cfg from 1 project to another
cp -p $NM_cfg/*.cfg ${altM_cfg}
Is it heavy use of pushd/popd?
Some people seem to really like that. You can try it and see what you thing.
Or creation of links to most often used directories?
Links to dirs are, in my experience used more for software development where a source code is expecting a certain set of dir names, and your installation has different names. Then making links to supply the dir paths expected is helpful. For production data, is just one more thing that can get messed up, or blow up. That's not always true, maybe you'll have a really good reason to have links, but I wouldn't start out that way, just because it is possible to do.
What are the best practices and a day-to-day routine to manage files of a command line master?
( Per above, use standardized directory structure for all projects.
Have scripts save any small files to a directory your dept keeps in the /tmp dir, .
i.e /tmp/MyDeptsTmpFile (named to fit your local conventions) )
It depends. If you're talking about data and logfiles, dated fileNames can save you a lot of time. I recommend dateFmts like YYYYMMDD(_HHMMSS) if you need the extra resolution.
Dated logfiles are very handy, when a current process seems like it is taking a long time, you can look at the log file from a week ago and quantify exactly how long this process took, a week, month, 6 months (up to how much space you can afford). LogFiles should also capture all STDERR messages, so you never have to re-run a bombed program just to see what the error message was.
This is Linux/Unix you're using, right? Read the man page for the cp cmd installed on your machine. I recommend using an alias like alias CP='/bin/cp -pi' so you always copy a file with the same permissions and with the original files' time stamp. Then it is easy to use /bin/ls -ltr to see a sorted list of files with the most recent files showing up at the bottom of the list. (No need to scroll back to the top, when you sort by time,reverse). Also the '-i' option will warn you that you are going to overwrite a file, and this has saved me more than a couple of times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?
You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.
Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers
The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

Resources