Find duplicates of a specific file on macOS - bash

I have a directory that contains files and other directories. And I have one specific file where I know that there are duplicates of somewhere in the given directory tree.
How can I find these duplicates using Bash on macOS?
Basically, I'm looking for something like this (pseudo-code):
$ find-duplicates --of foo.txt --in ~/some/dir --recursive
I have seen that there are tools such as fdupes, but I'm neither interested in any duplicate files (only duplicates of a specific file) nor am I interested in duplicates anywhere on disk (only within the given directory or its subdirectories).
How do I do this?

For a solution compatible with macOS built-in shell utilities, try this instead:
find DIR -type f -print0 | xargs -0 md5 -r | grep "$(md5 -q FILE)"
where:
DIR is the directory you are interested in;
FILE is the file (path) you are searching for duplicates of.
If you only need the duplicated files paths, then pipe thru this as well:
cut -d' ' -f2

If you're looking for a specific filename, you could do:
find ~/some/dir -name foo.txt
which would return a list of all files with the name foo.txt in the directory. If you're looking if there are multiple files in the directory with the same name, you could do:
find ~/some/dir -exec basename {} \; | sort | uniq -d
This will give you a list of files with duplicate names (you can then use find again to figure out where those live).
---- EDIT -----
If you're looking for identical files (with the same md5 sum), you could also do:
find . -type f -exec md5sum {} \; | sort | uniq -d --check-chars=32
--- EDIT 2 ----
If your md5sum doesn't output the filename, you can use:
find . -type f -exec echo -n "{} " \; -exec md5sum {} \; | awk {'print $2 $1'} | sort | uniq -d --check-chars=32
--- EDIT 3 ----
if you're looking for a file with a specific md5 sums:
sum=`md5sum foo.txt | cut -f1 -d " "`
find ~/some/dir -type f -exec md5sum {} \; | grep $sum

Related

Finding duplicate files in Unix by content

How to find the list of duplicate files recursively by content instead of file name
find . -type f -exec basename {} \; | sed 's/(.)../\1/' | sort | uniq -c | grep -v "^[ \t]*1 "
This will search the duplicate files with the folders.

remove files from subfolders without the last three

I have a structure like that:
/usr/local/a/1.txt
/usr/local/a/2.txt
/usr/local/a/3.txt
/usr/local/b/4.txt
/usr/local/b/3.txt
/usr/local/c/1.txt
/usr/local/c/7.txt
/usr/local/c/6.txt
/usr/local/c/12.txt
...
I want to delete all the files *.txt in subfolders except the last three files with the greatest modification date, but here I am in current directory
ls -tr *.txt | head -n-3 |xargs rm -f
I need to combine that with the code:
find /usr/local/**/* -type f
Should I use the maxdepth option?
Thanks for helping,
aola
Added maxdepth options to find for one level, sorting files by last modification time, tail to ignore the oldest modified 3 files and xargs with -r to remove the files only if they are found.
for folder in $(find /usr/local/ -type d)
do
find $folder -maxdepth 1 -type f -name "*.txt" | xargs -r ls -1tr | tail -n+3 | xargs -r rm -f
done
Run the above command once without rm to ensure that the previous commands pick the proper files for deletion.
You've almost got the solution: use find to get the files,ls to sort them by modification date and tail to omit three most recently modified ones:
find /usr/lib -type f | xargs ls -t | tail -n +4 | xargs rm
If you would like to remove only the files at a specified depth add -mindepth 4 -maxdepth 4 to find parameters.
You can use find's -printf option, to print the modification time in front of the file name and then sort and strip the date off. This avoids using ls at all.
find /usr/local -type f -name '*.txt' -printf '%T#|%p\n' | sort -r | cut -d '|' -f 2 | head -n-3 | xargs rm -f
The other Answers using xargs ls -t can lead to incorrect results, when there are more results than xargs can put in a single ls -t command.
but for each subfolder, so when I have
/usr/local/a/1.txt
/usr/local/a/2.txt
/usr/local/a/3.txt
/usr/local/a/4.txt
/usr/local/b/4.txt
/usr/local/b/3.txt
/usr/local/c/1.txt
/usr/local/c/7.txt
/usr/local/c/6.txt
/usr/local/c/12.txt
I want to to use the code for each subfolder separately
head -n-3 |xargs rm -f
so I bet if I have it sorted by date then the files to delete:
/usr/local/a/4.txt
/usr/local/c/12.txt
I want to leave in any subfolder three newest files

bash shell script not working as intended using cmp with output redirection

I am trying to write a bash script that remove duplicate files from a folder, keeping only one copy.
The script is the following:
#!/bin/sh
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
for f2 in `find ./ -name "*.txt"`
do
if [ -f $f2 ] && [ "$f1" != "$f2" ]
then
# if cmp $f1 $f2 &> /dev/null # DOES NOT WORK
if cmp $f1 $f2
then
rm $f2
echo "$f2 purged"
fi
fi
done
fi
done
I want to redirect the output and stderr to /dev/null to avoid printing them to screen.. But using the commented statement this script does not work as intended and removes all files but the first..
I'll give more informations if needed.
Thanks
Few comments:
First, the:
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
is the same as (find only plain files with the txt extension)
for f1 in `find ./ -type f -name "*.txt"`
Better syntax (bash only) is
for f1 in $(find ./ -type f -name "*.txt")
and finally the whole is wrong, because if the filename contains a space, the f1 variable will not get the full path name. So instead the for do:
find ./ -type f -name "*.txt" -print | while read -r f1
and as #Sir Athos pointed out, the filename can contain \n so the best is to use
find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1
Second:
Use "$f1" instead of $f1 - again, because the $f1 can contain space.
Third:
doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt file. When the checksum is identical - the files are dups.
If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)
Making checksums are slow to, so you should 1st compare ony files with the same size. Different sized files are not duplicates...
You can skip empty txt files - they are duplicates all :).
so the final command can be:
find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
commented:
#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\
#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\
#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\
#make an md5 checksum for them
xargs -0 md5sum |\
#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate
The output now, you can simply edit the output file and decide what want remove and what file want keep.
&> is bash syntax, you'll need to change the shebang line (first line) to #!/bin/bash (or the appropriate path to bash.
Or if you're really using the Bourne Shell (/bin/sh), then you have to use old-style redirection, i.e.
cmp ... >/dev/null 2>&1
Also, I think the &> was only introduced in bash 4, so if you're using bash, 3.X you'll still need the old-style redirections.
IHTH
Credit to #kobame for this answer: this is really a comment but for the formatting.
You don't need to call find twice, print out the size and the filename in the find command
find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 |
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum
An example
$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum
76fd4c1589ef708d9203f3cf09cfd032 ./a.txt
e2d75fd6a1080efb6230d0608b1f9014 ./b.txt
76fd4c1589ef708d9203f3cf09cfd032 ./d.txt
e2d75fd6a1080efb6230d0608b1f9014 ./e.txt
To keep one and delete the rest, I would pipe the output into:
... | awk '++seen[$1] > 1 {print $2}' | xargs echo rm
rm ./d.txt ./e.txt
Remove the echo if your testing is satisfactory.
Like many complex pipelines, filenames containing newlines will break it.
All nice answers, so only one short suggestion: you can install and use the
fdupes -r .
from the man:
Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison.
Added by #Francesco
fdupes -rf . | xargs rm -f
for remove dupes. (the -f in fdupes omit the 1st occurence the file, so list only dupes)

grep only text files

find . -type f | xargs file | grep text | cut -d':' -f1 | xargs grep -l "TEXTSEARCH" {}
it's a good solution? for find TEXTSEARCH recursively in only textual files
You can use the -r(recursive) and -I(ignore binary) options in grep:
$ grep -rI "TEXTSEARCH" .
-I Process a binary file as if it did not contain matching data; this is equivalent to the --binary-files=without-match option.
-r Read all files under each directory, recursively; this is equivalent to the -d recurse option.
Another, less elegant solution than kevs, is, to chain -exec commands in find together, without xargs and cut:
find . -type f -exec bash -c "file -bi {} | grep -q text" \; -exec grep TEXTSEARCH {} ";"
If you know what the file extension is that you want to search, then a very simple way to search all *.txt files from the current dir, recursively through all subdirs, case insensitive:
grep -ri --include=*.txt "sometext" *

How to echo directories containing matching file with Bash?

I want to write a bash script which will use a list of all the directories containing specific files. I can use find to echo the path of each and every matching file. I only want to list the path to the directory containing at least one matching file.
For example, given the following directory structure:
dir1/
matches1
matches2
dir2/
no-match
The command (looking for 'matches*') will only output the path to dir1.
As extra background, I'm using this to find each directory which contains a Java .class file.
find . -name '*.class' -printf '%h\n' | sort -u
From man find:
-printf format
%h Leading directories of file’s name (all but the last element). If the file name contains no slashes (since it is in the current directory) the %h specifier expands to ".".
On OS X and FreeBSD, with a find that lacks the -printf option, this will work:
find . -name *.class -print0 | xargs -0 -n1 dirname | sort --unique
The -n1 in xargs sets to 1 the maximum number of arguments taken from standard input for each invocation of dirname
GNU find
find /root_path -type f -iname "*.class" -printf "%h\n" | sort -u
Ok, i come way too late, but you also could do it without find, to answer specifically to "matching file with Bash" (or at least a POSIX shell).
ls */*.class | while read; do
echo ${REPLY%/*}
done | sort -u
The ${VARNAME%/*} will strip everything after the last / (if you wanted to strip everything after the first, it would have been ${VARNAME%%/*}).
Regards.
find / -name *.class -printf '%h\n' | sort --unique
Far too late, but this might be helpful to future readers:
I personally find it more helpful to have the list of folders printed into a file, rather than to Terminal (on a Mac).
For that, you can simply output the paths to a file, e.g. folders.txt, by using:
find . -name *.sql -print0 | xargs -0 -n1 dirname | sort --unique > folders.txt
How about this?
find dirs/ -name '*.class' -exec dirname '{}' \; | awk '!seen[$0]++'
For the awk command, see #43 on this list

Resources