How can I find a directory-diff of millions of files to script maintenance?

How can I find a directory-diff of millions of files to script maintenance? - shell

I have been working on how to verify that millions of files that were on file system A have infact been moved to file system B. While working on a system migration, it became evident that all the files needed to be audited to prove that the files have been moved. The files were initially moved via rsync, which does provide logs, although not in a format that is helpful for doing an audit. So, I wrote this script to index all the files on System A:
#!/bin/bash
# Get directories and file list to be used to verify proper file moves have worked successfully.
LOGDATE=`/usr/bin/date +%Y-%m-%d`
FILE_LIST_OUT=/mounts/A_files_$LOGDATE.txt
MOUNT_POINTS="/mounts/AA mounts/AB"
touch $FILE_LIST_OUT
echo TYPE,USER,GROUP,BYTES,OCTAL,OCTETS,FILE_NAME > $FILE_LIST_OUT
for directory in $MOUNT_POINTS; do
# format: type,user,group,bytes,octal,octets,file_name
gfind $directory -mount -printf "%y","%u","%g","%s","%m","%p\n" >> $FILE_LIST_OUT
done
The file indexing works fine and takes about two hours to index ~30 million files.
On side B is where we run into issues. I have written a very simple shell script that reads the index file, tests to see if the file is there, and then counts up how many files are there, but it's running out of memory while looping through the 30 million lines on indexed file names. Effectively doing this little bit of code below through a while loop, and counters to increment for files found and not found.
if [ -f "$TYPE" "$FILENAME" ] ; then
print file found
++
else
file not found
++
fi
My questions are:
Can a shell script do this type of reporting from such a large list. A 64 bit unix system ran out of memory while trying to execute this script. I have already considered breaking up the input script into smaller chunks to make it faster. Currently it can
If as shell script is inappropriate, what would you suggest?

You just used rsync, use it again...
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn’t affect the data that goes into the file-lists, and thus it doesn’t affect deletions. It just limits the files that the receiver requests to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to continue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hierarchy (when it is used properly), using --ignore existing will ensure that the already-handled files don’t get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that this option is only looking at the existing files in the destination hierarchy itself.
That will actually fix any problems (at least in the same sense that any diff-list on file-exist tests could fix problem. Using --ignore-existing means rsync only does the file-exist tests (so it'll construct the diff list as you request and use it internally). If you just want information on the differences, check --dry-run, and --itemize-changes.
Lets say you have two directories, foo and bar. Let's say bar has three files, 1,2, and 3. Let's say that bar, has a directory quz, which has a file 1. The directory foo is empty:
Now, here is the result,
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/
>f+++++++++ 1
>f+++++++++ 2
>f+++++++++ 3
cd+++++++++ quz/
>f+++++++++ quz/1
Note, you're not interested in the cd+++++++++ -- that's just showing you that rsync issued a chdir. Now, let's add a file in foo called 1, and let's use grep to remove the chdir(s),
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/ | grep -v '^cd'
>f+++++++++ 2
>f+++++++++ 3
>f+++++++++ quz/1
f is for file. The +++++++++ means the file doesn't exist in the DEST dir.
Here is the bonus, remove --dry-run, and, it'll go ahead and make the changes for you.

Have you considered a solution such as kdiff3, which will diff directories of files ?
Note the feature for version 0.9.84
Directory-Comparison: Option "Full Analysis" allows to show the number
of solved vs. unsolved conflicts or deltas vs. whitespace-changes in
the directory tree.

There is absolutely no problem reading a 30 million line file in a shell script. The reason why your process failed was most likely that you tried to read the file entirely into memory, e.g. by doing something wrong like for i in $(cat file).
The correct way of reading a file is:
while IFS= read -r line
do
echo "Something with $line"
done < someFile
A shell script is inappropriate, yes. You should be using a diff tool:
diff -rNq /original /new

If you're not particular about the solution being a script, you could also look into meld, which would let you diff directory trees quite easily and you can also set ignore patterns if you have any.

Related

Zip directory in different batches

I'm trying to zip a massive directory with images that will be fed into a deep learning system. This is incredibly time consuming, so I would like to stop prematurely the zipping proccess with Ctrl + C and zip the directory in different "batches".
Currently I'm using zip -r9v folder.zip folder, and I've seen that the option -u allows to update changed files and add new ones.
I'm worried about some file or the zip itself ending up corrupted if I terminate the process with Ctrl + C. From this answer I understand that the cp can be terminated safely, and this other answer suggests that gzip is also safe.
Putting it all together: Is it safe to end prematurely the zip command? Is the -u option viable for zipping in different batches?

Is it safe to end prematurely the zip command?
In my tests, canceling zip (Info-ZIP, 16 June 2008 (v3.0)) using CtrlC did not create a zip-archive at all, even when the already compressed data was 2.5GB. Therefore, I would say CtrlC is "safe" (you won't end up with a corrupted file, but also pointless (you did all the work for nothing).
Is the -u option viable for zipping in different batches?
Yes. Zip archives compress each file individually, so the archives you get from adding files later on are as good as adding all files in a single run. Just remember that starting zip takes time too. So set the batch size as high as acceptable to save time.
Here is a script that adds all your files to the zip archive, but gives a chance to stop the compression at every 100th file.
#! /bin/bash
batchsize=100
shopt -s globstar
files=(folder/**)
echo "Press enter to stop compression after this batch."
for ((startfile=0; startfile<"${#files[#]}"; startfile+=batchsize)); do
((startfile==0)) && u= || u=u
zip "-r9v$u" folder.zip "${files[#]:startfile:batchsize}"
u=u
if read -t 0; then
echo "Compression stopped before file $startfile."
echo "Re-run this script with startfile=$startfile to continue".
exit
fi
done
For more speed you might want to look into alternative zip implementations.

How to create tar files automatically

I like to create tar-files to distribute some scripts using bash.
For every script certain configuration-files and libraries (or toolboxes) are needed,
e.g. a script called CheckTool.py needs Checks.ini, CheckToolbox.py and CommontToolbox.py to run, which are stored in specific folders on my harddisk and need to be copied in the same manner on the users harddisk.
I can create a tarfile manually for each script, but i like to have it more simple.
For this i have the idea to define a list of all needed files and their pathes for a specific script and read this in a bashscript, which creates the tar file.
I started with:
#!/bin/bash
while read line
do
echo "$line"
done < $1
Which is reading the files and pathes. In my example the lines are:
./CheckTools/CheckMesh.bs
./Configs/CheckMesh.ini
./Toolboxes/CommonToolbox.bs
./Toolboxes/CheckToolbox.bs
My question is how do I have to organize the data to make a tar file with the specified files using bash?
Or is there someone having a better idea?

No need for a complicated script, use option -T of tar. Every file listed in there will be added to the tar file:
-T, --files-from FILE
get names to extract or create from FILE
So your script becomes:
#!/bin/bash
tar -cvpf something.tar -T listoffiles.txt
listoffiles.txt format is super easy, one file per line. You might want to put full path to ensure you get the right files:
./CheckTools/CheckMesh.bs
./Configs/CheckMesh.ini
./Toolboxes/CommonToolbox.bs
./Toolboxes/CheckToolbox.bs
You can add tar commands to the script as needed, or you could loop on the list files, from that point on, your imagination is the limit!

Delete Files That Are Too Long Using Bash

This is not a duplicate as I want to achieve the result without using Windows tools as mentioned in this post https://superuser.com/questions/78434/tool-for-deleting-directories-with-path-names-too-long-for-normal-delete
I am deleting SVN ignored files that Maven has created (something like this)
jaxb/com.mycompany.test.jaxb.upload-document/target
jaxb/com.mycompany.test.jaxb.fpml/target
jaxb/com.mycompany.test.jaxb.fixml/target
jaxb/com.mycompany.test.jaxb.trade-header/target
jaxb/com.mycompany.test.jaxb.schema-versioning/target
test-risk/target
I am using the following command but it fails when the path is too long on Windows.
echo "Deleting ignored files"
svn stat --no-ignore | awk '$1=="I" { print $0 }'|awk '{$1=""; print $0}'|tr '\\' '/'|while read file;do rm -rf "$file" ;done;
This is run as part of a script that runs on Unix as well as Windows which is why I can't use any of the previously mentioned tools.
What I need to do is write a routine that checks to see if there are any errors due to long file length and recursively rename the directories from the top down to the bottom with single characters in an attempt to shorten the path.
I'm not sure how to check for the errors ("Directory not empty" and " File or path name too long").
Ideally I'd like to fail the "rm -rf" immediately rather than throwing the error hundreds of times.
Finally, I'm not sure what the most efficient way to recursively rename the directories is.
EDIT: If I just assume that all directories are going to cause the issue I could get the roots cd to them in turn and recursively rename the sub-dirs. This way I don't have to worry about catching errors from rm.

How to tar a folder while files inside the folder might being written by some other process

I am trying to create a script for cron job. I have around 8 GB folder containing thousands of files. I am trying to create a bash script which first tar the folder and then transfer the tarred file to ftp server.
But I am not sure while tar is tarring the folder and some other process is accessing files inside it or writing to the files inside it.
Although its is fine for me if the tarred file does not contains that recent changes while the tar was tarring the folder.
suggest me the proper way. Thanks.

tar will hapilly tar "whatever it can". But you will probably have some surprises when untarring, as tar also stored the size of the file it tars, before taring it. So expect some surprises.
A very unpleasant surprise would be : if the size is truncated, then tar will "fill" it with "NUL" characters to match it's recorded size... This can give very unpleasant side effects. In some cases, tar, when untarring, will say nothing, and silently add as many NUL characters it needs to match the size (in fact, in unix, it doesn't even need to do that : the OS does it, see "sparse files"). In some cases, if truncating occured during the taring of the file, tar will complain it encounters an Unexpected End of File when untarring (as it expected XXX bytes but only reads fewer than this), but will still say that the file should be XXX bytes (and the unix OSes will then create it as a sparse file, with "NUL" chars magically appended at the end to match the expected size when you read it).
(to see the NUL chars : an easy way is to less thefile (or cat -v thefile | more on a very old unix. Look for any ^#)
But on the contrary, if files are only appended to (logs, etc), then the side effect is less problematic : you will only miss some bits of them (which you say you're ok about), and not have that unpleasant "fill with NUL characters" side effects. tar may complain when untarring the file, but it will untar it.

I think tar failed (so do not create archive) when an archived file is modified during archiving. As Etan said, the solution depends on what you want finally in the tarball.
To avoid a tar failure, you can simply COPY the folder elsewhere before to call tar. But in this case, you cannot be confident in the consistency of the backuped directory. It's NOT an atomic operation, so some files will be todate while other files will be outdated. It can be a severe issue or not follow your situation.
If you can, I suggest you configure how these files are created. For example: "only recent files are appended, files older than 1 day are never changed", in this case you can easily backup only old files and the backup will be consistent.
More generally, you have to accept to loose last data AND be not consistant (each files is backup at a different date), or you have to act at a different level. I suggest :
Configure the software that produces the data to choose a consistency
Or use OS/Virtualization features. For example it's possible to do consistent snapshot of a storage on some virtual storage...

Is there a way to move files from one set of directories to another set of corresponding directories

I take delivery of files from multiple places as part of a publishing aggregation service. I need a way to move files that have been delivered to me from one location to another without losing the directory listings for sorting purposes.
Example:
Filepath of delivery: Server/Vendor/To_Company/Customer_Name/**
Filepath of processing: ~/Desktop/MM-DD-YYYY/Returned_Files/Customer_Name/**
I know I can move all of the directories by doing something such as:
find Server/Vendor/To_Company/* -exec mv -n ~/Desktop/MM-DD-YYYY/Returned_Files \;
but using that I can only run the script one time per day and there are times when I might need to run it multiple times.
It seems like ideally I should be able to create a copycat directory in my daily processing folder and then move the files from one to the other.

you can use rsync command with --remove-source-files option. you can run it as many times as needed.
#for trial run, without making any actual transfer.
rsync --dry-run -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
#command
rsync -rv --remove-source-files Server/Vendor/To_Company/ ~/Desktop/MM-DD-YYYY/Returned_Files/
reference:
http://www.cyberciti.biz/faq/linux-unix-bsd-appleosx-rsync-delete-file-after-transfer/

You could use rsync to do this for you:
rsync -a --remove-source-files /Server/Vendor/To_Company/Customer_Name ~/Desktop/$(date +"%y-%m-%d")/Returned_files/
Add -n to do a dry run to make sure it does what you want.
From the manual page:
--remove-source-files
This tells rsync to remove from the sending side the files (meaning non-directories) that are a part of the
transfer and have been successfully duplicated on the receiving side.
Note that you should only use this option on source files that are quiescent. If you are using this to move
files that show up in a particular directory over to another host, make sure that the finished files get renamed
into the source directory, not directly written into it, so that rsync can’t possibly transfer a file that is
not yet fully written. If you can’t first write the files into a different directory, you should use a naming
idiom that lets rsync avoid transferring files that are not yet finished (e.g. name the file "foo.new" when it
is written, rename it to "foo" when it is done, and then use the option --exclude='*.new' for the rsync trans‐
fer).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio