diff between folders whilst ignoring filename changes - bash

How can I use diff in terminal but ignore changes in file names?
Currently this is what i'm doing:
diff -wrN folder1 folder2 | grep '^>' | wc -l
How can I do git diff between two commit ids whilst:
ignoring file rename
only look at java files
ignore specific folder names e.g. folder 'a' and 'b'
perform the grep '^>' | wc -l

You seem unaware of the hardness of this problem, so I'd like to point out why this is so difficult.
Given two directories which are equal in the beginning and both contain, say, 1000 files. Now you rename, say, 500 files in one of the directories. Renamings can vary greatly. A file called foobar.txt originally can be named DSC-3457.orig.jpg afterwards. The diff command cannot really find it again without having any idea about what has been renamed into what.
Additionally, a file called x could be renamed to y, while a file called y could be renamed to x. In this case it even is questionable whether this should be regarded a mere renaming or if simply both files' contents have been exchanged.
This all means that in general you will have large problems to accomplish this. Standard tools will not do this out-of-the-box.
This said, I have two aspects I want to point out which might help you.
File Sizes
You can sort all files by their file sizes and then diff each pair of the two directories. This can work perfectly well if all changes you have are only renamings and if all files are of different size. If you have several files of the same size (maybe by pure chance or because they are all of the same format which has a fixed size), you are in trouble again and will have to compare each possible pair of the same-size group.
Git-Diff
You mentioned git-diff in the tags. git actually keeps a record in case a file is renamed. So if you intend to use git diff, you can rely to some degree on git's ability to detect renamings. This typically works if a file is removed and added with a new name in one single commit. If it gets added with a new name in one commit and then the older version is removed in another commit, this won't work properly. There is a lot more to learn about renames in git diff; see man git diff and search for rename in this case, there are about a dozen places this gets mentioned, so I won't try to summarize this here myself.
EDIT: You can use a command like git diff --find-renames --diff-filter=ACDMTUX (i. e. you let all kinds of changes pass the filter with the exception of renamings).

Related

How to find duplicate directories

Let create some testing directory tree:
#!/bin/bash
top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }
mkfile() { printf "%s\n" $(basename "$1") > "$1"; }
mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"
The structure is: find testdir \( -type d -printf "%p/\n" , -type f -print \)
testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z
I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:
duplicate directories:
testdir/d1
testdir/d2/d1some
because in both (sub)trees are only two identical files a and b. (and several directories, without files).
Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(
Any idea how to do this? Honestly, I haven't any idea.
EDIT
I don't need working code. (I'm able to code myself)
I "just" need some ideas "how to approach" the solution of the problem. :)
Edit2
The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.
And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:
/some/path/project1/a
/some/path/project1/b
and
/some/path/project2/a
/some/path/project2/x
e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.
Some key points:
If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)
For the "how to":
For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
The one most important thing is realize the following key points:
you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!
e.g. for example if your directory /path/to/some containing only two files a and b and
if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
produces: 3a13f2408f269db87ef0110a90e168ae.
Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files
file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...
So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).
This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.
The final part is easy now. Your final data should be in form:
3a13f2408f269db87ef0110a90e168ae /some/directory
16ea2389b5e62bc66b873e27072b0d20 /another/directory
3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
so, from this is easy to get: the
/some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".
Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)
A traversal can identify directories which are duplicates in the sense you describe. I take it that this is: if all files in a directory are equal to all files of another then their paths are duplicates.
Find all files in each directory and form a string with their names. You can concatenate the names with a comma, say (or some other sequence that is certainly not in any names). This is to be compared. Prepend the path to this string, so to identify directories.
Comparison can be done for instance by populating a hash with keys being strings with filenames and path their values. Once you find that a key already exists you can check the content of files, and add the path to the list of duplicates.
The strings with path don't have to be actually formed, as you can build the hash and dupes list during the traversal. Having the full list first allows for other kinds of accounting, if desired.
This is altogether very little code to write.
An example. Let's say that you have
dir1/subdir1/{a,b} # duplicates (files 'a' and 'b' are considered equal)
dir2/subdir2/{a,b}
and
proj1/subproj1/{a,b,X} # NOT duplicates, since there are different files
proj2/subproj2/{a,b,Y}
The above prescription would give you strings
'dir1/subdir1/a,b',
'dir2/subdir2/a,b',
'proj1/subproj1/a,b,X',
'proj2/subproj2/a,b,Y';
where the (sub)string 'a,b' identifies dir1/subdir1 and dir2/subdir2 as duplicates.
I don't see how you can avoid a traversal to build a system that accounts for all files.
The procedure above is the first step, not handling directories with files and subdirectories.
Consider
dirA/ dirB/
a b sdA/ a X sdB/
c d c d
Here the paths dirA/sdA/ and dirB/sdB/ are duplicates by the problem description but the whole dirA/ and dirB/ are distinct. This isn't shown in the question but I'd expect it to be of interest.
The procedure from the first part can be modified for this. Iterate through directories, forming a path component at every step. Get all files in each, and all subdirectories (if none we are done). Append the comma-separated file list to the path component (/sdA/). So the representation of the above is
'dirA/sdA,a,b/c,d', 'dirB/sdB,a,X/c,d'
For each file-list substring (c,d) found to already exist we can check its path against the existing one, component by component. Now a hash with keys like c,d won't do since this example has the same file-list for distinct hierarchies, but a modified (or other) data structure is needed.
Finally, there may be more subdirectories parallel to sdA (say sdA2). We care only for its own path, but except for the parallel files (a,b, in that component of the path dirA/sdaA2,a,b/). So keep in mind all bottom-level file-lists (c,d) with their paths and, if file-lists are equal and paths are of same length, check whether their paths have a,b file-lists equal in each path component.
I don't know whether this is a workable solution for you, but I'd expect "near-duplicates" to be rare -- the backup is either a duplicate or not. So there may not be much need to handle futher edge-cases in complex sprawling hierarchies. This procedure should be at least a useful pre-selection mechanism, that would greatly reduce the need for further work.
This assumes that equal file-names very likely indicate equal files. A part of that is my expectation that if a file was even just renamed it still cannot be considered a duplicate. If this is not so this approach won't work and one would need something along the lines of the answer by jm666.
I make a tool which searches duplicate folders.
https://github.com/un1t/dirdups
dirdups testdir -i 1
-i 1 option consider folders as duplicates if they have at least 1 file in common. Without this option default value is 10.
In your case it will find the following directories:
testdir/d1/d12/
testdir/d2/d1some/d12copy/

Bash get all specific files in specific directory

I have a script that takes as an argument a path to a file upon which it performs certain operations. These files are stored in directories with path storage///_id/files (so in 2016 July 22 it would be storage/2016/Jul/22_1/files for the first set of files, .../Jul/22_2/files for second one etc.). The problem is each directory stores files with two extensions (say file.doc, file.txt) and I want to perform operations only on .txt files. I've tested earlier something like
for file in "/home/gonczor/temp/"*/*".txt"; do
echo "$file"
done
And it worked perfectly given that names in directories don't change. When I move one step further and add this 22_1, 22_2, 23_1 directories something strange happens.
This is my script (simplified):
for file in "$FILE_PATH/""$YEAR/""$MONTH/""$DAY"*/*".txt"; do
my_program ${report}
done
And instead of finding .../2016/Jul/22_1/file.txt it finds /2016/Jul/22*/*.txt
How can I make it work? The solution I've tried to make up is from here

diff of identical tar archive return they are not identical

I have a script that generates a tar archive using the command
tar -zacf /tmp/foo.tar.gz /home/yotam/foo
it then check if a tar file is in a certain folder, and check if there is any changes between the two archives, if so, it keeps the new one
if ! [ -e /home/yotam/barr/foo.tar.gz ]; then
cp /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz
cond=1
else
#compare
diff --brief <(sort /tmp/foo.tar.gz) <(sort /home/yotam/bar/foo.tar.gz) >/dev/null
cond=$?
fi
if [ $cond -eq 1 ]; then
rm /home/yotam/bar/foo.tar.gz
cp /tmp/foo.tar.gz /home/yotam/foo.tar.gz
fi
However, this script always view the two archive files as different, even if I'm not doing anything in any of the two archives or the foo folder itself. What is wrong with my check?
Edit:
for what it worth, replacing the diff file with
diff --brief /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz >/dev/null
yield the same result.
I'm not sure that gzip archive can be used as a hash-function. Perhaps gzip packaging implementation relies on current date-time and then produces different output for each execution.
I'd recommend to use some widely used hash function. Take a look at git internal hash implementation - shasum, for example.
More at: How does git compute file hashes?
It looks like you're doing a line-wise compare of zipped tar archives, after sorting the lines. There are multiple reasons why this is a bad idea (for one: sorting by like for something that is gzipped doesn't make sense). To check whether 2 files, either use diff file1 file2, or calculate a hash for each file (with md5/md5sum filename) and compare those.
The problem is that gzip adds the name of the files it gzips in the zip archive. If you have 2 identical files and then gzip these, you will get 2 different archives.
So what can you do to solve this? For one you can compare gunziped versions of both files: diff <(gzcat out/out2.tar.gz) <(gzcat out2.tar.gz). I assume you have the sort in there in case the files get tarred in a different order, but I don't think you have to worry about that. If that is a problem for you, check out something like tarsum. This will give you a better result, since if you use sort, you will not notice moving a line from one file to the other, or switching two lines in a file.

mac: compare two folders, extract non-identical photos only

I have been looking in S.O. for ways to compare two folders to extract only the "new" or the "non-identical" photos. I have two large folders and I only want the new files, I need to identify them. Is there a way to do it? or an application to do it?
situation: with the iOS8, I gave it a try, and then I revert back to iOS711. but my most recent backup (the one I did before upgrading to iOS8) was corrupt, due to the downgrade I did.
Now, I have a copy of my iOS photos from the recent backup (I had to use backup extractor to extract the photos from), but I also have the photos from a month old backup that I restored into my phone after I gave up on the corrupt and recent backup.
I have now two sets of photo libraries. one with the up to date photos (which cannot be restored in the iPhone through iTunes), and one with the month old photo library (which was restored to my iPhone through iTunes easily).
I extracted photos from both backups, and I ended up with two directories. I only need the new photos (the difference between the two folders).
I hope it's now clearer, and more detailed.
Thanks a lot!
Can you install duff via homebrew or macports? If so, the following should give you a list of files that occur only once:
$ duff -r -f '' folder1 folder2 | sort > duplicate_files.txt
$ find folder1 folder2 -print | sort > all_files.txt
$ diff all_files.txt duplicate_files.txt | grep '^< ' | cut -c 3-
If you don't want to install additional packages, this would also work:
sort <(ls dir1) <(ls dir2) | uniq -u
That'll sort the list of files in both directories and then return only the items that appear only once. If you want to also return the locations of those files, you could then search for them.
This compares files by name, which might not be desirable. If you want to compare them by something else (e.g. size), then the answer gets a little more complicated.

Excluding folders in Winrar

A Day with Winrar
All I wanted to do was exclude folders and their contents using wildcards, and even after reading the docs, it turned into a guessing game...
So my test bed looks like:
C:\!tmp1\f1
C:\!tmp1\f1\f1.txt
C:\!tmp1\f1\a
C:\!tmp1\f1\a\a.txt
C:\!tmp1\f2
C:\!tmp1\f2\f2.txt
C:\!tmp1\f2\a
C:\!tmp1\f2\a\a.txt
And I am executing:
C:\>"c:\program files\winrar\winrar.exe" a -r !tmp1.rar !tmp1
which gives me a rar with !tmp1 as the root (sole top level folder).
The exclude switch is -x<filepathpattern> and may be included multiple times.
So, given that we want to exclude f2, and all its subcontents...
-x*\f2\*
removes the contents, but leaves f2
-xf2
does nothing - includes all
-x\f2
does nothing - includes all
-x*\f2
does nothing - includes all (now I'm mad), so surely it must be..
-x\f2\
nope, does nothing - includes all. So it has GOT to be...
-x*\f2\
hell no, does nothing - includes all. and I already know that
-x*\f2\*
removes the contents, but leaves f2. Onward we go...
-x*f2\
does nothing - includes all. Grrrr. Aha! how about...
-x!tmp1\f2\
nope, does nothing - includes all. WTF. Alright, So it has GOT to be...
-x!tmp1\f2
Holy moly, it worked! Hmmm, then how come....
-x*\f2
does not work? This was the little demon that sent me down this crazed path to begin with and should have worked!
Given all that, do I dare try to go after */a/* directories, removing contents and the dirs?
-x*\a
does not work, of course, does nothing.
-x*\*\a
does not work, of course, does nothing.
-x!tmp1\*\a
nope. But...
-x*\a\*
removes contents of both dirs, but leaves the folders. So, in desperation I can use the -ed switch which will not store empty folders, but this is a broad hack, I want to eliminate the folders specified not all empty folders.
With my animosity growing toward winrar, I am passing the baton of information forward with an eye to that glorious day when we will know how to specifically exclude a folder and its contents using wildcards and not using the -ed switch.
(Quite old question but still may be relevant)
Maybe what you simply needed was this :
-x*\f2 -x*\f2\*
two exclude switches, should remove directory f2 and all its contents.
An even older question by now, but came across this question so I reproduced your folder structure and, at least nowadays (Winrar 5.11, not the latest but quite new), this works:
-x*\f2
So the whole command line is:
"C:\Program Files\WinRAR\Rar.exe" a -m5 -s !tmp1.rar !tmp1 -x*\f2
And this is what is stored in the .rar file:
!tmp1\f1\a\a.txt
!tmp1\f1\f1.txt
!tmp1\f1\a
!tmp1\f1
!tmp1
Similarly, if you use -x*\a, all a folders are excluded, storing this:
!tmp1\f1\f1.txt
!tmp1\f2\f2.txt
!tmp1\f1
!tmp1\f2
!tmp1
Finally, combining both parameters (-x*\f2 -x*\a), you get this:
!tmp1\f1\f1.txt
!tmp1\f1
!tmp1
To manage large list of files to be excluded, you can create text fie and write all excluded files/folders relative to the source folder:
1) create file list.txt, write the name of excluded files/folders
note: * refer to the source, all files/folders are relative to the source folder
*\f2
*\f3
2) Run the command
rar a -r -x#list.txt target.rar source-folder

Resources