Recursive directory-diff accepting external diff - shell

Consider the task of finding what has changed in two projects which were forked.
diff -r gets me close: it is capable of finding which files are missing in each target folder, and points out files which have changed by presenting their diffs.
I want to use a custom diff utility. I would like to not have to implement in my diff utility the recursive directory walking logic.
So, I basically just want a program that does what diff -r does, but which does not actually go ahead and run the diffs.
Does such a thing exist?

I figured the output of diff -r is already plenty structured enough for me to "get clever" with it. Here's the gist of it.
diff -r proj1/src proj2/src | grep 'diff -r' | cut -d ' ' -f 3,4 | xargs -n 2 sift
where sift is my little command-line char-based diff util which runs circles around diff's diff output.
and using diff (GNU diffutils) 2.8.1
I am open to more elegant solutions as well!
Edit: Thanks #janos, the -q option makes it pretty optimal!
One last thing to mention is that this can be made quite powerful by piping into the opendiff program on a Mac's command line (specifying the corresponding file in the desired dir as target, which of course is already inside a Git repo, right?) to do the manual merging nice and quickly.
In fact setting up opendiff to be used by Git when it needs a human merge is probably the way to go.
It's just that I still have not encountered very many merge conflict situations across the same code repo, it is mainly when forking repos (and having separate repos for divergent projects that contain shared code) that I need to do this kind of merge to manually bring my "primary" projects up to date with the changes made in the trenches.

I think this is more elegant:
diff -rq dir1 dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/\1\n\2/p' | xargs -n 2 sift
The main trick is using -q flag, which will print the differences in brief format, for example:
Files dir1/x and dir2/x differ
Only in dir1: path/to/file
Only in dir2: path/to/another
And then how you parse the output is your matter of taste.
Finally, to correctly handle spaces in the file names:
diff -rq dir1 dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/sift "\1" "\2"/p' | sh
It probably makes sense to wrap this in a function:
xdiff() {
diff=$1; shift
dir1=$1; shift
dir2=$1; shift
diff -rq $dir1 $dir2 | sed -ne 's/^Files \(.*\) and \(.*\) differ$/'$diff' "\1" "\2"/p' | sh
}
So you can call it like this:
xdiff sift dir1 dir2

Related

How do I find duplicate files by comparing them by size (ie: not hashing) in bash

How do I find duplicate files by comparing them by size (ie: not hashing) in bash.
Testbed files:
-rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf
-rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf
-rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf
Yes, files can have spaces (Boo!).
I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.
My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.
So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls, so I need another way to get actual size (and not a du approximate).
"Vote to close!"
Hold up cowboy.
I bet you're going to suggest this (cool, you can google search):
https://unix.stackexchange.com/questions/71176/find-duplicate-files
No fdupes (nor jdupes, nor...), nor finddup, nor rmlint, nor fslint - I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.
https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash
All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.
https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files
Proposed:
$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breaks for me:
find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory
https://unix.stackexchange.com/questions/170693/compare-directory-trees-regarding-file-name-and-size-and-date
Haven't been able to figure out how rsync -nrvc --delete might work in the same directory, but there might be solution in there.
Well how about cmp? Yeah, that looks pretty good, actually!
cmp -z file1 file2
Bummer, my version of cmp does not include the -z size option.
However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.
if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1 # Cuz no set -e, and trap not working
fi
for i in ./*
do
for j in ./*
do
if [[ "$i" != "$j" ]]; then # Yes, it will be identical to itself
if [[ $(cmp -s "$i" "$j") ]]; then
echo "null" # Cuz I can't use negative of the comparison?
else
mv -i "$i" ../Dupes/
fi
fi
done
done
https://unix.stackexchange.com/questions/367749/how-to-find-and-delete-duplicate-files-within-the-same-directory
Might have something I could use, but I'm not following what's going on in there.
https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible
If it were something that returns size, instead of md5, maybe one of the answers in here?
https://unix.stackexchange.com/questions/570305/what-is-the-most-efficient-way-to-find-duplicate-files
Didn't really get answered.
TIL: Sending errors from . scriptname will close my terminal instantly. Thanks, Google!
TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug + trap checkcommand DEBUG are set in profile to try and catch rm -r * - but at least will respect my alias for exit
TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P
TIL: How to catch non-ascii characters in filenames, without using basename
TIL: "${file##*/}"
TIL: file - yes, X.pdf is not a PDF.
On the matter of POSIX
I'm afraid you cannot get the actual file size (not the number of blocks allocated by the file) in a plain posix shell without using ls. All the solutions like du --apparent-size, find -printf %s, and stat are not posix.
However, as long as your filenames don't contain linebreaks (spaces are ok) you could create safe solutions relying on ls. Correctly handling filenames with linebreaks would require very non-posix tools (like GNU sort -z) anyway.
Bash+POSIX Approach Actually Comparing The Files
I would drop the approach to compare only the file sizes and use cmp instead. For huge directories the posix script will be slow no matter what you do. Also, I expect cmp to do some fail fast checks (like comparing the file sizes) before actually comparing the file contents. For common scenarios with only a few files speed shouldn't matter anyway as even the worst script will run fast enough.
The following script places each group of actual duplicates (at least two, but can be more) into its own subdirectory of dups/. The script should work with all filenames; spaces, special symbols, and even linebreaks are ok. Note that we are still using bash (which is not posix). We just assume that all tools (like mv, find, ...) are posix.
#! /usr/bin/env bash
files=()
for f in *; do [ -f "$f" ] && files+=("$f"); done
max=${#files[#]}
for (( i = 0; i < max; i++ )); do
sameAsFileI=()
for (( j = i + 1; j < max; j++ )); do
cmp -s "${files[i]}" "${files[j]}" &&
sameAsFileI+=("${files[j]}") &&
unset 'files[j]'
done
(( ${#sameAsFileI[#]} == 0 )) && continue
mkdir -p "dups/$i/"
mv "${files[i]}" "${sameAsFileI[#]}" "dups/$i/"
# no need to unset files[i] because loops won't visit this entry again
files=("${files[#]}") # un-sparsify array
max=${#files[#]}
done
Fairly Portable Non-POSIX Approach Using File Sizes Only
If you need a faster approach that only compares the file sizes I suggest to not use a nested loop. Loops in bash are slow already, but if you nest them you have quadratic time complexity. It is faster and easier to ...
print only the file sizes without file names
apply sort | uniq -d to retrieve duplicates in time O(n log n)
Move all files having one of the duplicated sizes to a directory
This solution is not strictly posix conform. However, I tried to verify, that the tools and options in this solution are supported by most implementations. Your find has to support the non-posix options -maxdepth and -printf with %s for the actual file size and %f for the file basename (%p for the full path would be acceptable too).
The following script places all files of the same size into the directory potential-dups/. If there are two files of size n and two files of size m all four files end up in this single directory. The script should work with all file names expect those with linebreaks (that is \n; \r should be fine though).
#! /usr/bin/env sh
all=$(find . -maxdepth 1 -type f -printf '%s %f\n' | sort)
dupRegex=$(printf %s\\n "$all" | cut -d' ' -f1 | uniq -d |
sed -e 's/[][\.|$(){}?+*^]/\\&/g' -e 's/^/^/' | tr '\n' '|' | sed 's/|$//')
[ -z "$dupRegex" ] && exit
mkdir -p potential-dups
printf %s\\n "$all" | grep -E "$dupRegex" | cut -d' ' -f2- |
sed 's/./\\&/' | xargs -I_ mv _ potential-dups
In case you wonder about some of the sed commands: They quote the file names such that spaces and special symbols are processed correctly by subsequent tools. sed 's/[][\.|$(){}?+*^]/\\&/g' is for turning raw strings into equivalent extended regular expressions (ERE) and sed 's/./\\&/' is for literal processing by xargs. See the posix documentation of xargs:
-I replstr [...] Any <blank>s at the beginning of each line shall be ignored.
[...]
Note that the quoting rules used by xargs are not the same as in the shell. [...] An easy rule that can be used to transform any string into a quoted form that xargs interprets correctly is to precede each character in the string with a backslash.

How can I use Git to identify function changes across different revisions of a repository?

I have a repository with a bunch of C files. Given the SHA hashes of two commits,
<commit-sha-1> and <commit-sha-2>,
I'd like to write a script (probably bash/ruby/python) that detects which functions in the C files in the repository have changed across these two commits.
I'm currently looking at the documentation for git log, git commit and git diff. If anyone has done something similar before, could you give me some pointers about where to start or how to proceed.
That doesn't look too good but you could combine git with your
favorite tagging system such as GNU global to achieve that. For
example:
#!/usr/bin/env sh
global -f main.c | awk '{print $NF}' | cut -d '(' -f1 | while read i
do
if [ $(git log -L:"$i":main.c HEAD^..HEAD | wc -l) -gt 0 ]
then
printf "%s() changed\n" "$i"
else
printf "%s() did not change\n" "$i"
fi
done
First, you need to create a database of functions in your project:
$ gtags .
Then run the above script to find functions in main.c that were
modified since the last commit. The script could of course be more
flexible, for example it could handle all *.c files changed between 2 commits as reported by git diff --stats.
Inside the script we use -L option of git log:
-L <start>,<end>:<file>, -L :<funcname>:<file>
Trace the evolution of the line range given by
"<start>,<end>" (or the function name regex <funcname>)
within the <file>. You may not give any pathspec
limiters. This is currently limited to a walk starting from
a single revision, i.e., you may only give zero or one
positive revision arguments. You can specify this option
more than once.
See this question.
Bash script:
#!/usr/bin/env bash
git diff | \
grep -E '^(##)' | \
grep '(' | \
sed 's/##.*##//' | \
sed 's/(.*//' | \
sed 's/\*//' | \
awk '{print $NF}' | \
uniq
Explanation:
1: Get diff
2: Get only lines with hunk headers; if the 'optional section heading' of a hunk header exists, it will be the function definition of a modified function
3: Pick only hunk headers containing open parentheses, as they will contain function definitions
4: Get rid of '## [old-file-range] [new-file-range] ##' sections in the lines
5: Get rid of everything after opening parentheses
6: Get rid of '*' from pointers
7: [See 'awk']: Print the last field (i.e: column) of the records (i.e: lines).
8: Get rid of duplicate names.

find - grep taking too much time

First of all I'm a newbie with bash scripting so forgive me if i'm making easy mistakes.
Here's my problem. I needed to download my company's website. I accomplish this using wget with no problems but because some files have the ? symbol and windows doesn't like filenames with ? I had to create a script that renames files and also update the source code of all files that calls the rename file.
To accomplish this I use the following code:
find . -type f -name '*\?*' | while read -r file ; do
SUBSTRING=$(echo $file | rev | cut -d/ -f1 | rev)
NEWSTRING=$(echo $SUBSTRING | sed 's/?/-/g')
mv "$file" "${file//\?/-}"
grep -rl "$SUBSTRING" * | xargs sed -i '' "s/$SUBSTRING/$NEWSTRING/g"
done
This is having 2 problems.
This is taking way too long, I've waited more than 5 hours and is still going.
It looks like is doing a append in the source code because when i stop the script and search for changes the URL is repeated like 4 times ( or more ).
Thanks all for your comments, i will try the 2 separete step and see, also, just as FYI, there are 3291 files that were downloaded with wget, still thinking that using bash scripting is prefer over other tools for this?
Seems odd that a file would have ? in it. Website URLs have ? to indicate passing of parameters. wget from a website also doesn't guarantee you're getting the site, especially if server side execution takes place, like php files. So, I suspect as wget does its recursiveness, it's finding url's passing parameters and thus creating them for you.
To really get the site, you should have direct access to the files.
If I were you, I'd start over and not use wget.
You may also be having issues with files or directories with spaces in their name.
Instead of that line with xargs, you're already doing one file at a time, but grepping for all recursively. Just do the sed on the new file itself.
Ok, here's the idea (untested):
in the first loop, just move the files and compose a global sed replacement file
once it is done, just scan all the files and apply sed with all the patterns at once, thus saving a lot of read/write operations which are likely to be the cause of the performance issue here
I would avoid to put the current script in the current directory or it will be processed by sed, so I suppose that all files to be processed are not in the current dir but in data directory
code:
sedfile=/tmp/tmp.sed
data=data
rm -f $sedfile
# locate ourselves in the subdir to preserve the naming logic
cd $data
# rename the files and compose the big sedfile
find . -type f -name '*\?*' | while read -r file ; do
SUBSTRING=$(echo $file | rev | cut -d/ -f1 | rev)
NEWSTRING=$(echo $SUBSTRING | sed 's/?/-/g')
mv "$file" "${file//\?/-}"
echo "s/$SUBSTRING/$NEWSTRING/g" >> $sedfile
done
# now apply the big sedfile once on all the files:
# if you need to go recursive:
find . -type f | xargs sed -i -f $sedfile
# if you don't:
sed -i -f $sedfile *
Instead of using grep, you can use the find command or ls command to list the files and then operate directly on them.
For example, you could do:
ls -1 /path/to/files/* | xargs sed -i '' "s/$SUBSTRING/$NEWSTRING/g"
Here's where I got the idea based on another question where grep took too long:
Linux - How to find files changed in last 12 hours without find command

Diff files in two folders ignoring the first line

I have two folders of files that I want to diff, except I want to ignore the first line in all the files. I tried
diff -Nr <(tail -n +1 folder1/) <(tail -n +1 folder2/)
but that clearly isn't the right way.
If the first lines that you want to ignore have a distinctive format that can be matched by a POSIX regular expression, then you can use diff's --ignore-matching-lines=... option to tell it to ignore those lines.
Failing that, the approach you want to take probably depends on your exact requirements. You say you "want to diff" the files, but it's not obvious exactly how faithfully your resulting output needs to match what you would get from diff -Nr if it supported that feature. (For example, do you need the line numbers in the diff to correctly identify the line numbers in the original files?)
The most precisely faithful approach would probably be as follows:
Copy each directory to a fresh location, using cp --recursive ....
Edit the first line of each file to prepend a magic string like IGNORE_THIS_LINE::, using something like find -type f -exec sed -i '1 s/^/IGNORE_THIS_LINE::/' '{}' ';'.
Use diff -Nr --ignore-matching-lines=^IGNORE_THIS_LINE:: ... to compare the results.
Pipe the output to sed s/IGNORE_THIS_LINE:://, so as to filter out any occurrences of IGNORE_THIS_LINE:: that still show up (due to being within a few lines of non-ignored differences).
Using Process Substitution ist the correct way to create intermediate input file descriptors. But tail doesnt work on folders. Just iterate over all the files in the folder:
for f in folder1/*.txt; do
tail -n +2 $f | diff - <(tail -n +2 folder2/$(basename $f))
done
Note i used +2 instead of +1. tail line numbering starts at line 1 not 0

Shell script to compare directories recursively

I have a file server backup on an external hard drive a few months old for a file server that went down since then. Most of the data was recovered onto a temporary file server thats been in use since then but there are some inconsistencies.
I am going to mount the external and rsync it with the current data to it but first I need to establish files that have gotten updated on the newer copy.
I can do diff -r -q /old/ /new/ to obtain this, I am trying to get better at scripting so I am trying to write something that will rename the old file to filename.old whenever diff returns a difference.
So after checking, I wasn't able to find an option in diff to only output the filename differences so we'll just work with what diff outputs.
If diff finds files that differ, the output is something like this:
Files old/file and new/file differ
Since all your bash script would be doing is renaming the changed file from the old directory, we want to extract old/file from this output. Let's start by only displaying lines like Files...differ (as other lines may be produced):
diff -rq old/ new/ | grep "^Files.*differ$"
Now you'll only get lines like the one shown before. Next step is getting the filename. You can do this with awk by adding something like awk '{print $2}' as another pipe but if the filename itself contains spaces, awk will break up that as two separate strings. We'll use sed instead:
diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'
Now this will produce a list of files that have changed in the old directory. Using a simple for loop, you can now rename each of the files:
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
And that's it!
edit: actually, that's not all. This loop still fails on files with spaces so let's muck with it a bit. for will try to produce a list separated by a space by default. Let's change this to use newlines instead:
OLD_IFS=$IFS
# The extra space after is crucial
IFS=\
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
IFS=$OLD_IFS
This temporarily replaces bash's default separator ($IFS) to a newline and puts it back after it's done with the loop so you don't split by space.
There used to be a program dircmp lurking around systems. If you have it, use it.
If you don't have and can't find it, then I have a version of it with minor extensions that you can use (see my profile for contact information).
Example output:
Files in ifxchkpath-4.12 only and in ifxchkpath-5.15 only
./Makefile ./absname.1
./program.mk ./absname.c
./chk.servers.sh
./chunklist.sh
./clnpath.c
./clnpath.h
./dirname.c
./enable.uids.sh
./errhelp.c
[...]
./lpt.pl
./makefile
./nvstrcpy.c
./realpath.c
./realpathtest.sh
./stderr.c
./stderr.h
./symlinks.tgz
./testids.mk
./test.linkpath.pl
./test.lpt.sh
./test.onsecurity.sh
./tokenise.c
Comparison of files in ifxchkpath-4.12 and ifxchkpath-5.15
directory .
different ./chk.ifxchkpath.sh
different ./ifxchkpath.c
same ./test.ifxchkpath.pl

Resources