Zip directory in different batches - bash

I'm trying to zip a massive directory with images that will be fed into a deep learning system. This is incredibly time consuming, so I would like to stop prematurely the zipping proccess with Ctrl + C and zip the directory in different "batches".
Currently I'm using zip -r9v folder.zip folder, and I've seen that the option -u allows to update changed files and add new ones.
I'm worried about some file or the zip itself ending up corrupted if I terminate the process with Ctrl + C. From this answer I understand that the cp can be terminated safely, and this other answer suggests that gzip is also safe.
Putting it all together: Is it safe to end prematurely the zip command? Is the -u option viable for zipping in different batches?

Is it safe to end prematurely the zip command?
In my tests, canceling zip (Info-ZIP, 16 June 2008 (v3.0)) using CtrlC did not create a zip-archive at all, even when the already compressed data was 2.5GB. Therefore, I would say CtrlC is "safe" (you won't end up with a corrupted file, but also pointless (you did all the work for nothing).
Is the -u option viable for zipping in different batches?
Yes. Zip archives compress each file individually, so the archives you get from adding files later on are as good as adding all files in a single run. Just remember that starting zip takes time too. So set the batch size as high as acceptable to save time.
Here is a script that adds all your files to the zip archive, but gives a chance to stop the compression at every 100th file.
#! /bin/bash
batchsize=100
shopt -s globstar
files=(folder/**)
echo "Press enter to stop compression after this batch."
for ((startfile=0; startfile<"${#files[#]}"; startfile+=batchsize)); do
((startfile==0)) && u= || u=u
zip "-r9v$u" folder.zip "${files[#]:startfile:batchsize}"
u=u
if read -t 0; then
echo "Compression stopped before file $startfile."
echo "Re-run this script with startfile=$startfile to continue".
exit
fi
done
For more speed you might want to look into alternative zip implementations.

Related

Directory monitoring using fswatch

I am using fswatch to monitor a directory and run a script when video files are copied into that directory:
fswatch -o /Path/To/Directory/Directory | xargs -n 1 sh /Path/To/Script/Script.sh
The problem is that the file is often not completed its copy before the script is actioned. The files are video files of varying size. Small files are OK, larger files are not.
How can I delay the fswatch notification until the file has completed its copy?
First of all, the behaviour of the fswatch "monitors" is OS-specific: when asking question about fswatch you'd better specify the OS you use.
However, there's no way to do that using fswatch alone. A process may open a file for writing and keep it open for an amount of time sufficiently long for the OS to send multiple events. I'm afraid there is nothing fswatch can do about it.
An alternate approach may be using another tool to check whether the modified file is currently open: if it is not, then run your script, otherwise skip it and wait for its next event. Such tools are OS-specific: in OS X and Linux you may use lsof. Beware this approach does not protect you from another process opening that file while your script is running.

How to create a batch file in Mac?

I need to find a solution at work to backup specific folders daily, hopefully to a RAR or ZIP file.
If it was on PC, I would have done it already. But I don't have any idea to how to approach it on a Mac.
What I basically want to achieve is an automated task, that can be run with an executable, that does:
compress a specific directory (/Volumes/Audio/Shoko) to a rar or zip file.
(in the zip file exclude all *.wav files in all sub Directories and a directory names "Videos").
move It to a network share (/Volumes/Post Shared/Backup From Sound).
(or compress directly to this folder).
automate the file name of the Zip file with dynamic date and time (so no duplicate file names).
Shutdown Mac when finished.
I want to say again, I don't usually use Mac, so things like what kind of file to open for the script, and stuff like that is not trivial for me, yet.
I have tried to put Mark's bash lines (from the first answer, below) in a txt file and executed it, but it had errors and didn't work.
I also tried to use Automator, but it's too plain, no advanced options.
How can I accomplish this?
I would love a working example :)
Thank You,
Dave
You can just make a bash script that does the backup and then you can either double-click it or run it on a schedule. I don't know your paths and/or tools of choice, but some thing along these lines:
#!/bin/bash
FILENAME=`date +"/Volumes/path/to/network/share/Backup/%Y-%m-%d.tgz"`
cd /directory/to/backup || exit 1
tar -cvz "$FILENAME" .
You can save that on your Desktop as backup and then go in Terminal and type:
chmod +x ~/Desktop/backup
to make it executable. Then you can just double click on it - obviously after changing the paths to reflect what you want to backup and where to.
Also, you may prefer to use some other tools - such as rsync but the method is the same.

How to tar a folder while files inside the folder might being written by some other process

I am trying to create a script for cron job. I have around 8 GB folder containing thousands of files. I am trying to create a bash script which first tar the folder and then transfer the tarred file to ftp server.
But I am not sure while tar is tarring the folder and some other process is accessing files inside it or writing to the files inside it.
Although its is fine for me if the tarred file does not contains that recent changes while the tar was tarring the folder.
suggest me the proper way. Thanks.
tar will hapilly tar "whatever it can". But you will probably have some surprises when untarring, as tar also stored the size of the file it tars, before taring it. So expect some surprises.
A very unpleasant surprise would be : if the size is truncated, then tar will "fill" it with "NUL" characters to match it's recorded size... This can give very unpleasant side effects. In some cases, tar, when untarring, will say nothing, and silently add as many NUL characters it needs to match the size (in fact, in unix, it doesn't even need to do that : the OS does it, see "sparse files"). In some cases, if truncating occured during the taring of the file, tar will complain it encounters an Unexpected End of File when untarring (as it expected XXX bytes but only reads fewer than this), but will still say that the file should be XXX bytes (and the unix OSes will then create it as a sparse file, with "NUL" chars magically appended at the end to match the expected size when you read it).
(to see the NUL chars : an easy way is to less thefile (or cat -v thefile | more on a very old unix. Look for any ^#)
But on the contrary, if files are only appended to (logs, etc), then the side effect is less problematic : you will only miss some bits of them (which you say you're ok about), and not have that unpleasant "fill with NUL characters" side effects. tar may complain when untarring the file, but it will untar it.
I think tar failed (so do not create archive) when an archived file is modified during archiving. As Etan said, the solution depends on what you want finally in the tarball.
To avoid a tar failure, you can simply COPY the folder elsewhere before to call tar. But in this case, you cannot be confident in the consistency of the backuped directory. It's NOT an atomic operation, so some files will be todate while other files will be outdated. It can be a severe issue or not follow your situation.
If you can, I suggest you configure how these files are created. For example: "only recent files are appended, files older than 1 day are never changed", in this case you can easily backup only old files and the backup will be consistent.
More generally, you have to accept to loose last data AND be not consistant (each files is backup at a different date), or you have to act at a different level. I suggest :
Configure the software that produces the data to choose a consistency
Or use OS/Virtualization features. For example it's possible to do consistent snapshot of a storage on some virtual storage...

Is there a way to move files from one set of directories to another set of corresponding directories

I take delivery of files from multiple places as part of a publishing aggregation service. I need a way to move files that have been delivered to me from one location to another without losing the directory listings for sorting purposes.
Example:
Filepath of delivery: Server/Vendor/To_Company/Customer_Name/**
Filepath of processing: ~/Desktop/MM-DD-YYYY/Returned_Files/Customer_Name/**
I know I can move all of the directories by doing something such as:
find Server/Vendor/To_Company/* -exec mv -n ~/Desktop/MM-DD-YYYY/Returned_Files \;
but using that I can only run the script one time per day and there are times when I might need to run it multiple times.
It seems like ideally I should be able to create a copycat directory in my daily processing folder and then move the files from one to the other.
you can use rsync command with --remove-source-files option. you can run it as many times as needed.
#for trial run, without making any actual transfer.
rsync --dry-run -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
#command
rsync -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
reference:
http://www.cyberciti.biz/faq/linux-unix-bsd-appleosx-rsync-delete-file-after-transfer/
You could use rsync to do this for you:
rsync -a --remove-source-files /Server/Vendor/To_Company/Customer_Name ~/Desktop/$(date +"%y-%m-%d")/Returned_files/
Add -n to do a dry run to make sure it does what you want.
From the manual page:
--remove-source-files
This tells rsync to remove from the sending side the files (meaning non-directories) that are a part of the
transfer and have been successfully duplicated on the receiving side.
Note that you should only use this option on source files that are quiescent. If you are using this to move
files that show up in a particular directory over to another host, make sure that the finished files get renamed
into the source directory, not directly written into it, so that rsync can’t possibly transfer a file that is
not yet fully written. If you can’t first write the files into a different directory, you should use a naming
idiom that lets rsync avoid transferring files that are not yet finished (e.g. name the file "foo.new" when it
is written, rename it to "foo" when it is done, and then use the option --exclude='*.new' for the rsync trans‐
fer).

How can I find a directory-diff of millions of files to script maintenance?

I have been working on how to verify that millions of files that were on file system A have infact been moved to file system B. While working on a system migration, it became evident that all the files needed to be audited to prove that the files have been moved. The files were initially moved via rsync, which does provide logs, although not in a format that is helpful for doing an audit. So, I wrote this script to index all the files on System A:
#!/bin/bash
# Get directories and file list to be used to verify proper file moves have worked successfully.
LOGDATE=`/usr/bin/date +%Y-%m-%d`
FILE_LIST_OUT=/mounts/A_files_$LOGDATE.txt
MOUNT_POINTS="/mounts/AA mounts/AB"
touch $FILE_LIST_OUT
echo TYPE,USER,GROUP,BYTES,OCTAL,OCTETS,FILE_NAME > $FILE_LIST_OUT
for directory in $MOUNT_POINTS; do
# format: type,user,group,bytes,octal,octets,file_name
gfind $directory -mount -printf "%y","%u","%g","%s","%m","%p\n" >> $FILE_LIST_OUT
done
The file indexing works fine and takes about two hours to index ~30 million files.
On side B is where we run into issues. I have written a very simple shell script that reads the index file, tests to see if the file is there, and then counts up how many files are there, but it's running out of memory while looping through the 30 million lines on indexed file names. Effectively doing this little bit of code below through a while loop, and counters to increment for files found and not found.
if [ -f "$TYPE" "$FILENAME" ] ; then
print file found
++
else
file not found
++
fi
My questions are:
Can a shell script do this type of reporting from such a large list. A 64 bit unix system ran out of memory while trying to execute this script. I have already considered breaking up the input script into smaller chunks to make it faster. Currently it can
If as shell script is inappropriate, what would you suggest?
You just used rsync, use it again...
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn’t affect the data that goes into the file-lists, and thus it doesn’t affect deletions. It just limits the files that the receiver requests to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to continue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hierarchy (when it is used properly), using --ignore existing will ensure that the already-handled files don’t get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that this option is only looking at the existing files in the destination hierarchy itself.
That will actually fix any problems (at least in the same sense that any diff-list on file-exist tests could fix problem. Using --ignore-existing means rsync only does the file-exist tests (so it'll construct the diff list as you request and use it internally). If you just want information on the differences, check --dry-run, and --itemize-changes.
Lets say you have two directories, foo and bar. Let's say bar has three files, 1,2, and 3. Let's say that bar, has a directory quz, which has a file 1. The directory foo is empty:
Now, here is the result,
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/
>f+++++++++ 1
>f+++++++++ 2
>f+++++++++ 3
cd+++++++++ quz/
>f+++++++++ quz/1
Note, you're not interested in the cd+++++++++ -- that's just showing you that rsync issued a chdir. Now, let's add a file in foo called 1, and let's use grep to remove the chdir(s),
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/ | grep -v '^cd'
>f+++++++++ 2
>f+++++++++ 3
>f+++++++++ quz/1
f is for file. The +++++++++ means the file doesn't exist in the DEST dir.
Here is the bonus, remove --dry-run, and, it'll go ahead and make the changes for you.
Have you considered a solution such as kdiff3, which will diff directories of files ?
Note the feature for version 0.9.84
Directory-Comparison: Option "Full Analysis" allows to show the number
of solved vs. unsolved conflicts or deltas vs. whitespace-changes in
the directory tree.
There is absolutely no problem reading a 30 million line file in a shell script. The reason why your process failed was most likely that you tried to read the file entirely into memory, e.g. by doing something wrong like for i in $(cat file).
The correct way of reading a file is:
while IFS= read -r line
do
echo "Something with $line"
done < someFile
A shell script is inappropriate, yes. You should be using a diff tool:
diff -rNq /original /new
If you're not particular about the solution being a script, you could also look into meld, which would let you diff directory trees quite easily and you can also set ignore patterns if you have any.

Resources