diff a directory recursively, ignoring all binary files - bash

Working on a Fedora Constantine box. I am looking to diff two directories recursively to check for source changes. Due to the setup of the project (prior to my own engagement with said project! sigh), the directories contain both source and binaries, as well as large binary datasets. While diffing eventually works on these directories, it would take perhaps twenty seconds if I could ignore the binary files.
As far as I understand, diff does not have an 'ignore binary file' mode, but does have an ignore argument which will ignore regular expression within a file. I don't know what to write there to ignore binary files, regardless of extension.
I'm using the following command, but it does not ignore binary files. Does anyone know how to modify this command to do this?
diff -rq dir1 dir2

Kind of cheating but here's what I used:
diff -r dir1/ dir2/ | sed '/Binary\ files\ /d' >outputfile
This recursively compares dir1 to dir2, sed removes the lines for binary files(begins with "Binary files "), then it's redirected to the outputfile.

Maybe use grep -I (which is equivalent to grep --binary-files=without-match) as a filter to sort out binary files.
dir1='folder-1'
dir2='folder-2'
IFS=$'\n'
for file in $(grep -Ilsr -m 1 '.' "$dir1"); do
diff -q "$file" "${file/${dir1}/${dir2}}"
done

I came to this (old) question looking for something similar (Config files on a legacy production server compared to default apache installation). Following #fearlesstost's suggestion in the comments, git is sufficiently lightweight and fast that it's probably more straightforward than any of the above suggestions. Copy version1 to a new directory. Then do:
git init
git add .
git commit -m 'Version 1'
Now delete all the files from version 1 in this directory and copy version 2 into the directory. Now do:
git add .
git commit -m 'Version 2'
git show
This will show you Git's version of all the differences between the first commit and the second. For binary files it will just say that they differ. Alternatively, you could create a branch for each version and try to merge them using git's merge tools.

If the names of the binary files in your project follow a specific pattern (*.o, *.so, ...) as they usually do, you can put those patterns in a file and specify it using -X (hyphen X).
Contents of my exclude_file
*.o
*.so
*.git
Command:
diff -X exclude_file -r . other_tree > my_diff_file
UPDATE:
-x can be used instead of -X, to specify exclusion patterns on the command line rather than in a file:
diff -r -x *.o -x *.so -x *.git dir1 dir2

Use a combination of find and the file command. This requires you to do some research on the output of the file command in your directory; below I'm assuming that the files you want to diff is reported as ascii. OR, use grep -v to filter out the binary files.
#!/bin/bash
dir1=/path/to/first/folder
dir2=/path/to/second/folder
cd $dir1
files=$(find . -type f -print | xargs file | grep ASCII | cut -d: -f1)
for i in $files;
do
echo diffing $i ---- $dir2/$i
diff -q $i $dir2/$i
done
Since you probably know the names of the huge binaries, place them in a hash-array and only do the diff when a file is not in the hash,something like this:
#!/bin/bash
dir1=/path/to/first/directory
dir2=/path/to/second/directory
content_dir1=$(mktemp)
content_dir2=$(mktemp)
$(cd $dir1 && find . -type f -print > $content_dir1)
$(cd $dir2 && find . -type f -print > $content_dir2)
echo Files that only exist in one of the paths
echo -----------------------------------------
diff $content_dir1 $content_dir2
#Files 2 Ignore
declare -A F2I
F2I=( [sqlite3]=1 [binfile2]=1 )
while read f;
do
b=$(basename $f)
if ! [[ ${F2I[$b]} ]]; then
diff $dir1/$f $dir2/$f
fi
done < $content_dir1

Well, as a crude sort of check, you could ignore files that match /\0/.

Related

Rename files in bash based on content inside

I have a directory which has 70000 xml files in it. Each file has a tag which looks something like this, for the sake of simplicity:
<ns2:apple>, <ns2:orange>, <ns2:grapes>, <ns2:melon>. Each file has only one fruit tag, i.e. there cannot be both apple and orange in the same file.
I would like rename every file (add "1_" before the beginning of each filename) which has one of: <ns2:apple>, <ns2:orange>, <ns2:melon> inside of it.
I can find such files with egrep:
egrep -r '<ns2:apple>|<ns2:orange>|<ns2:melon>'
So how would it look as a bash script, which I can then user as a cron job?
P.S. Sorry I don't have any bash script draft, I have very little experience with it and the time is of the essence right now.
This may be done with this script:
#!/bin/sh
find /path/to/directory/with/xml -type f | while read f; do
grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>' "$f" && mv "$f" "1_${f}"
done
But it will rescan the directory each time it runs and append 1_ to each file containing one of your tags. This means a lot of excess IO and files with certain tags will be getting 1_ prefix each run, resulting in names like 1_1_1_1_file.xml.
Probably you should think more on design, e.g. move processed files to two directories based on whether file has certain tags or not:
#!/bin/sh
# create output dirs
mkdir -p /path/to/directory/with/xml/with_tags/ /path/to/directory/with/xml/without_tags/
find /path/to/directory/with/xml -maxdepth 1 -mindepth 1 -type f | while read f; do
if grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>'; then
mv "$f" /path/to/directory/with/xml/with_tags/
else
mv "$f" /path/to/directory/with/xml/without_tags/
fi
done
Run this command as a dry run, then remove --dry_run to actually rename the files:
grep -Pl '(<ns2:apple>|<ns2:orange>|<ns2:melon>)' *.xml | xargs rename --dry-run 's/^/1_/'
The command-line utility rename comes in many flavors. Most of them should work for this task. I used the rename version 1.601 by Aristotle Pagaltzis. To install rename, simply download its Perl script and place into $PATH. Or install rename using conda, like so:
conda install rename
Here, grep uses the following options:
-P : Use Perl regexes.
-l : Suppress normal output; instead print the name of each input file from which output would normally have been printed.
SEE ALSO:
grep manual

Recursively compare specific files in different directories

Similar posts here:
Diff files present in two different directories
and here:
https://superuser.com/q/602877/520666
But not quite what I'm looking for.
I have 2 directories (containing subdirectories and different file types -- binary, images, html, etc.).
I want to be able to recursively compares files with specific extensions (e.g. .html, .strings, etc.) between the two directories -- they may or may not exist in either (sub)directory.
How can I accomplish this? Diff only seems to support exclusions, and I'm not sure how I can leverage Find for this.
Advice?
You could exclude all unwanted fileendings with find:
(this version only matches against file endings)
diff -r -x `find . -type f -name '*.*' | sed 's|.*\.|.*\.|' | sort -u | grep -v YOURFILETYPE | paste -sd "|"` ...rest of diff command
Or you generate the list of excluded files upfront and pass it to the diff:
(this version also matches against filenames and every other regex you specify in include.file)
find /dirA -type f | grep -v YOURFILEENDING > exclude.list
find /dirB -type f | grep -v YOURFILEENDING >> exclude.list
diff -X exclude.list -r /dirA /dirB
If you chain these commands via && you'll get a handy oneliner ;)
WITH INCLUDE FILE
If you want to use an include file, you can use this Method:
You specify the include file
grep matches against all files in the folders and turns your includefile into an exclude file for diff (diff only takes exclude files)
Here is an example:
Complicated inline version:
(this version only matches against file endings)
diff -r -x `find . -type f -name '*.*' | sed 's|.*\.|.*\.|' sort -u | grep -v -f include.file | paste -sd "|"` /dirA /dirB
Slightly longer simpler version:
(this version also matches against filenames and every other regex you specify in include.file)
find /dirA -type f | grep -v -f include.file > exclude.list
find /dirB -type f | grep -v -f include.file >> exclude.list
diff -X exclude.list -r /dirA /dirB
with each line in include.file being a grep regex/expression:
log
txt
fileending3
whateverfileendingyoulilke
fullfilename.txt
someotherregex.*
NOTE
I did not run these because I'm nowhere near a computer.
I hope I got all syntax correct.
The simplest thing you can do is to compare the whole directories:
diff -r /path/the/first /path/the/second
It will show which files are only in one of the directories, which files differ in a binary fashion, and the full diff for any textual files in both directories.
You can loop over a set of relative paths by simply reading a file with a path per line thusly:
while IFS= read -u 9 relative_path
do
diff "/path/the/first/%{relative_path}" "/path/the/second/%{relative_path}"
done 9< relative_paths.txt
Doing this for a specific set of extensions is similarly easy:
shopt -s globstar
while IFS= read -u 9 extension do
diff "/path/the/first/"**/*."${extension}" "/path/the/second/"**/*."${extension}"
done 9< extensions.txt

Shell: Copy list of files with full folder structure stripping N leading components from file names

Consider a list of files (e.g. files.txt) similar (but not limited) to
/root/
/root/lib/
/root/lib/dir1/
/root/lib/dir1/file1
/root/lib/dir1/file2
/root/lib/dir2/
...
How can I copy the specified files (not any other content from the folders which are also specified) to a location of my choice (e.g. ~/destination) with a) intact folder structure but b) N folder components (in the example just /root/) stripped from the path?
I already managed to use
cp --parents `cat files.txt` ~/destination
to copy the files with an intact folder structure, however this results in all files ending up in ~/destination/root/... when I'd like to have them in ~/destination/...
I think I found a really nice an concise solution by using GNU tar:
tar cf - -T files.txt | tar xf - -C ~/destination --strip-components=1
Note the --strip-components option that allows to remove an arbitrary number of path components from the beginning of the file name.
One minor problem though: It seems tar always "compresses" the whole content of folders mentioned in files.txt (at least I couldn't find an option to ignore folders), but that is most easily solved using grep:
cat files.txt | grep -v '/$' > files2.txt
This might not be the most graceful solution - but it works:
for file in $(cat files.txt); do
echo "checking for $file"
if [[ -f "$file" ]]; then
file_folder=$(dirname "$file")
destination_folder=/destination/${file_folder#/root/}
echo "copying file $file to $destination_folder"
mkdir -p "$destination_folder"
cp "$file" "$destination_folder"
fi
done
I had a look at cp and rsync, but it looks like they would benefit more if you to cd into /root first.
However, if you did cd to the correct directory before hand, you could always run it as a subshell so that you would be returned to your original location once the subshell has finished.

Can I limit the recursion when copying using find (bash)

I have been given a list of folders which need to be found and copied to a new location.
I have basic knowledge of bash and have created a script to find and copy.
The basic command I am using is working, to a certain degree:
find ./ -iname "*searchString*" -type d -maxdepth 1 -exec cp -r {} /newPath/ \;
The problem I want to resolve is that each found folder contains the files that I want, but also contains subfolders which I do not want.
Is there any way to limit the recursion so that only the files at the root level of the found folder are copied: all subdirectories and files therein should be ignored.
Thanks in advance.
If you remove -R, cp doesn't copy directories:
cp *searchstring*/* /newpath
The command above copies dir1/file1 to /newpath/file1, but these commands copy it to /newpath/dir1/file1:
cp --parents *searchstring*/*(.) /newpath
for GNU cp and zsh
. is a qualifier for regular files in zsh
cp --parents dir1/file1 dir2 copies file1 to dir2/dir1 in GNU cp
t=/newpath;for d in *searchstring*/;do mkdir -p "$t/$d";cp "$d"* "$t/$d";done
find *searchstring*/ -type f -maxdepth 1 -exec rsync -R {} /newpath \;
-R (--relative) is like --parents in GNU cp
find . -ipath '*searchstring*/*' -type f -maxdepth 2 -exec ditto {} /newpath/{} \;
ditto is only available on OS X
ditto file dir/file creates dir if it doesn't exist
So ... you've been given a list of folders. Perhaps in a text file? You haven't provided an example, but you've said in comments that there will be no name collisions.
One option would be to use rsync, which is available as an add-on package for most versions of Unix and Linux. Rsync is basically an advanced copying tool -- you provide it with one or more sources, and a destination, and it makes sure things are synchronized. It knows how to copy things recursively, but it can't be told to limit its recursion to a particular depth, so the following will copy each item specified to your target, but it will do so recursively.
xargs -L 1 -J % rsync -vi -a % /path/to/target/ < sourcelist.txt
If sourcelist.txt contains a line with /foo/bar/slurm, then the slurm directory will be copied in its entiriety to /path/to/target/slurm/. But this would include directories contained within slurm.
This will work in pretty much any shell, not just bash. But it will fail if one of the lines in sourcelist.txt contains whitespace, or various special characters. So it's important to make sure that your sources (on the command line or in sourcelist.txt) are formatted correctly. Also, rsync has different behaviour if a source directory includes a trailing slash, and you should read the man page and decide which behaviour you want.
You can sanitize your input file fairly easily in sh, or bash. For example:
#!/bin/sh
# Avoid commented lines...
grep -v '^[[:space:]]*#' sourcelist.txt | while read line; do
# Remove any trailing slash, just in case
source=${line%%/}
# make sure source exist before we try to copy it
if [ -d "$source" ]; then
rsync -vi -a "$source" /path/to/target/
fi
done
But this still uses rsync's -a option, which copies things recursively.
I don't see a way to do this using rsync alone. Rsync has no -depth option, as find has. But I can see doing this in two passes -- once to copy all the directories, and once to copy the files from each directory.
So I'll make up an example, and assume further that folder names do not contain special characters like spaces or newlines. (This is important.)
First, let's do a single-pass copy of all the directories themselves, not recursing into them:
xargs -L 1 -J % rsync -vi -d % /path/to/target/ < sourcelist.txt
The -d option creates the directories that were specified in sourcelist.txt, if they exist.
Second, let's walk through the list of sources, copying each one:
# Basic sanity checking on input...
grep -v '^[[:space:]]*#' sourcelist.txt | while read line; do
if [ -d "$line" ]; then
# Strip trailing slashes, as before
source=${line%%/}
# Grab the directory name from the source path
target=${source##*/}
rsync -vi -a "$source/" "/path/to/target/$target/"
fi
done
Note the trailing slash after $source on the rsync line. This causes rsync to copy the contents of the directory, rather than the directory.
Does all this make sense? Does it match your requirements?
You can use find's ipath argument:
find . -maxdepth 2 -ipath './*searchString*/*' -type f -exec cp '{}' '/newPath/' ';'
Notice the path starts with ./ to match find's search directory, ends with /* in order to exclude files in the top level directory, and maxdepth is set to 2 to only recurse one level deep.
Edit:
Re-reading your comments, it seems like you want to preserve the directory you're copying from? E.g. when searching for foo*:
./foo1/* ---> copied to /newPath/foo1/* (not to /newPath/*)
./foo2/* ---> copied to /newPath/foo2/* (not to /newPath/*)
Also, the other requirement is to keep maxdepth at 1 for speed reasons.
(As pointed out in the comments, the following solution has security issues for specially crafted names)
Combining both, you could use this:
find . -maxdepth 1 -type d -iname 'searchString' -exec sh -c "mkdir -p '/newPath/{}'; cp "{}/*" '/newPath/{}/' 2>/dev/null" ';'
Edit 2:
Why not ditch find altogether and use a pure bash solution:
for d in *searchString*/; do mkdir -p "/newPath/$d"; cp "$d"* "/newPath/$d"; done
Note the / at the end of the search string, causing only directories to be considered for matching.

Unix Shell scripting for copying files and creating directory

I have a source directory eg /my/source/directory/ and a destination directory eg /my/dest/directory/, which I want to mirror with some constraints.
I want to copy files which meet certain criteria of the find command, eg -ctime -2 (less than 2 days old) to the dest directory to mirror it
I want to include some of the prefix so I know where it came from, eg /source/directory
I'd like to do all this with absolute paths so it doesn't depend which directory I run from
I'd guess not having cd commands is good practice too.
I want the subdirectories created if they don't exist
So
/my/source/directory/1/foo.txt -> /my/dest/directory/source/directory/1/foo.txt
/my/source/directory/2/3/bar.txt -> /my/dest/directory/source/directory/2/3/bar.txt
I've hacked together the following command line but it seems a bit ugly, can anyone do better?
find /my/source/directory -ctime -2 -type f -printf "%P\n" | xargs -IFILE rsync -avR /my/./source/directory/FILE /my/dest/directory/
Please comment if you think I should add this command line as an answer myself, I didn't want to be greedy for reputation.
This is remarkably similar to a (closed) question: Bash scripting copying files without overwriting. The answer I gave cites the 'find | cpio' solution mentioned in other answers (minus the time criteria, but that's the difference between 'similar' and 'same'), and also outlines a solution using GNU 'tar'.
ctime
When I tested on Solaris, neither GNU tar nor (Solaris) cpio was able to preserve the ctime setting; indeed, I'm not sure that there is any way to do that. For example, the touch command can set the atime or the mtime or both - but not the ctime. The utime() system call also only takes the mtime or atime values; it does not handle ctime. So, I believe that if you find a solution that preserves ctime, that solution is likely to be platform-specific. (Weird example: hack the disk device and edit the data in the inode - not portable, requires elevated privileges.) Rereading the question, though, I see that 'preserving ctime' is not part of the requirements (phew); it is simply the criterion for whether the file is copied or not.
chdir
I think that the 'cd' operations are necessary - but they can be wholly localized to the script or command line, though, as illustrated in the question cited and the command lines below, the second of which assumes GNU tar.
(cd /my; find source/directory -ctime -2 | cpio -pvdm /my/dest/directory)
(cd /my; find source/directory -ctime -2 | tar -cf - -F - ) |
(cd /my/dest/directory; tar -xf -)
Without using chdir() (aka cd), you need specialized tools or options to handle the manipulation of the pathnames on the fly.
Names with blanks, newlines, etc
The GNU-specific 'find -print0' and 'xargs -0' are very powerful and effective, as noted by Adam Hawes. Funnily enough, GNU cpio has an option to handle the output from 'find -print0', and that is '--null' or its short form '-0'. So, using GNU find and GNU cpio, the safe command is:
(cd /my; find source/directory -ctime -2 -print0 |
cpio -pvdm0 /my/dest/directory)
Note:This does not overwrite pre-existing files under the backup directory. Add -u to the cpio command for that.
Similarly, GNU tar supports --null (apparently with no -0 short-form), and could also be used:
(cd /my; find source/directory -ctime -2 -print0 | tar -cf - -F - --null ) |
(cd /my/dest/directory; tar -xf -)
The GNU handling of file names with the null terminator is extremely clever and a valuable innovation (though I only became aware of it fairly recently, courtesy of SO; it has been in GNU tar for at least a decade).
You could try cpio using the copy-pass mode, -p. I usually use it with overwrite all (-u), create directories (-d), and maintain modification time (-m).
find myfiles | cpio -pmud target-dir
Keep in mind that find should produce relative path names, which doesn't fit your absolute path criteria. This cold be of course be 'solved' using cd, which you also don't like (why not?)
(cd mypath; find myfiles | cpio ... )
The brackets will spawn a subshell, and will keep the state-change (i.e. the directory switch) local. You could also define a small procedure to abstract away the 'uglyness'.
IF you're using find always use -print0 and pipe the output through xargs -0; well almost always. The first file with a space in its name will bork the script if you use the default newline terminator output of find.
I agree with all the other posters - use cpio or tar if you can. It'll do what you want and save the hassle.
An alternative is to use tar,
(cd $SOURCE; tar cf - .) | (cd $DESTINATION; tar xf -)
EDIT:
Ah, I missed the bit about preserving CTIME. I believe most implementations of tar will preserve mtime, but if preserving ctime is critical, then cpio is indeed the only way.
Also, some tar implementations (GNU tar being one) can select the files to include based on atime and mtime, though seemingly not ctime.
#!/bin/sh
SRC=/my/source/directory
DST=/my/dest/directory
for i in $(find $SRC -ctime -2 -type f) ; do
SUBDST=$DST$(dirname $i)
mkdir -p $SUBDST
cp -p $i $SUBDST
done
And I suppose, since you want to include "where it came from", that you are going to use different source directories. This script can be modified to take source dir as an argument simply by replacing SRC=/my/source/directory, with SRC=$1
EDIT: Removed redundant if statement.
Does not work when filenames have whitespaces.
!/usr/bin/sh
script to copy files with same directory structure"
echo "Please enter Full Path of Source DIR (Starting with / and ending with /):"
read spath
echo " Please enter Full Path of Destination location (Starting with / and ending with /):"
read dpath
si=echo "$spath" | awk -F/ '{print NF-1}'
for fname in find $spath -type f -print
do
cdir=echo $fname | awk -F/ '{ for (i='$si'; i<NF; i++) printf "%s/", $i; printf "\n"; }'
if [ $cdir ]; then
if [ ! -d "$dpath$cdir" ]; then
mkdir -p $dpath$cdir
fi
fi
cp $fname $dpath$cdir
done

Resources