Finding duplicate files in Unix by content - shell

How to find the list of duplicate files recursively by content instead of file name

find . -type f -exec basename {} \; | sed 's/(.)../\1/' | sort | uniq -c | grep -v "^[ \t]*1 "
This will search the duplicate files with the folders.

Related

Shell script file copy checker

I copied and re-sorted nearly 1TB of files on a Drobo using a find . -name \*.%ext% -print0 | xargs -I{} -0 cp -v {} %File%s command. I need to make sure all the files copied correctly. This is what I have so far:
#!/bin/sh
find . -type f -exec basename {} \; > Filelist.txt
sort -o Filelist.txt Filelist.txt
uniq -c Filelist.txt Duplist.txt
I need to find a way get the checksum for each file as well as making sure all of them are duplicated. The source folder is in the same directory as the copies, it is arranged as follows:
_Stock
_Audio
_CG
_Images
_Source (the files in all other folders come from here)
_Videos
I'm working on OSX.
#!/bin/sh
find . \( ! -regex '.*/\..*' \) -type f -exec shasum {} \; -exec basename {} \; | cut -c -40 | sed 'N;s/\n/ /' > Filelist.txt
sort -o Filelist.txt Filelist.txt
uniq -c Filelist.txt Duplist.txt
sort -o Duplist.txt Duplist.txt
The regex expression removes hidden files, the shasum and basename arguments create two separate outputs in the text file so we | to cut and then sed to merge the outputs so that the sort and uniq commands can parse them. The script is messy but it got the job done quite nicely.

Find duplicates of a specific file on macOS

I have a directory that contains files and other directories. And I have one specific file where I know that there are duplicates of somewhere in the given directory tree.
How can I find these duplicates using Bash on macOS?
Basically, I'm looking for something like this (pseudo-code):
$ find-duplicates --of foo.txt --in ~/some/dir --recursive
I have seen that there are tools such as fdupes, but I'm neither interested in any duplicate files (only duplicates of a specific file) nor am I interested in duplicates anywhere on disk (only within the given directory or its subdirectories).
How do I do this?
For a solution compatible with macOS built-in shell utilities, try this instead:
find DIR -type f -print0 | xargs -0 md5 -r | grep "$(md5 -q FILE)"
where:
DIR is the directory you are interested in;
FILE is the file (path) you are searching for duplicates of.
If you only need the duplicated files paths, then pipe thru this as well:
cut -d' ' -f2
If you're looking for a specific filename, you could do:
find ~/some/dir -name foo.txt
which would return a list of all files with the name foo.txt in the directory. If you're looking if there are multiple files in the directory with the same name, you could do:
find ~/some/dir -exec basename {} \; | sort | uniq -d
This will give you a list of files with duplicate names (you can then use find again to figure out where those live).
---- EDIT -----
If you're looking for identical files (with the same md5 sum), you could also do:
find . -type f -exec md5sum {} \; | sort | uniq -d --check-chars=32
--- EDIT 2 ----
If your md5sum doesn't output the filename, you can use:
find . -type f -exec echo -n "{} " \; -exec md5sum {} \; | awk {'print $2 $1'} | sort | uniq -d --check-chars=32
--- EDIT 3 ----
if you're looking for a file with a specific md5 sums:
sum=`md5sum foo.txt | cut -f1 -d " "`
find ~/some/dir -type f -exec md5sum {} \; | grep $sum

Show and count all file extensions in directory (with subdirectories)

I'm using command from this topic to view all file extensions in directory and all subdirectories.
find . -type f -name '*.*' | sed 's|.*\.||' | sort -u
How do I can count number of appearance for each extension?
Like:
png: 140
Like this, using uniq with the -c, --count flag:
find . -type f -name '*.*' | sed 's|.*\.||' | sort | uniq -c

Find Files having multiple links in shell script

I want to find the files which have multiple links.
I am using ubuntu 10.10.
find -type l
It will shows all links to the file but I want to count links for particular file.
Thanks.
With this command, you will get a sumary of linked files:
find . -type l -exec readlink -f {} \; | sort | uniq -c | sort -n
or
find . -type l -print0 | xargs -n1 -0 readlink -f | sort | uniq -c | sort -n

grep only text files

find . -type f | xargs file | grep text | cut -d':' -f1 | xargs grep -l "TEXTSEARCH" {}
it's a good solution? for find TEXTSEARCH recursively in only textual files
You can use the -r(recursive) and -I(ignore binary) options in grep:
$ grep -rI "TEXTSEARCH" .
-I Process a binary file as if it did not contain matching data; this is equivalent to the --binary-files=without-match option.
-r Read all files under each directory, recursively; this is equivalent to the -d recurse option.
Another, less elegant solution than kevs, is, to chain -exec commands in find together, without xargs and cut:
find . -type f -exec bash -c "file -bi {} | grep -q text" \; -exec grep TEXTSEARCH {} ";"
If you know what the file extension is that you want to search, then a very simple way to search all *.txt files from the current dir, recursively through all subdirs, case insensitive:
grep -ri --include=*.txt "sometext" *

Resources