Find duplicates in variable - bash

I am trying to found duplicates in a list. Right now I am searching for a list of files with specific file extensions and storing these files in a variable called 'files'.
For each file in files I am formatting these so only have the filename.
I then want to check this list for duplicates but I can't get my head around it.
files=$(find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \))
for file in $files; do
formatted=$(echo ${file##*/})
unique=$(echo $formatted | sort | uniq -c)
done
echo $unique
Any help is much appreciated!!

Find duplicates in variable
I guess you don't need to reinvent the wheel, simply use fdupes ot fslint
Depending on your system, you can install it by using:
yum -y install fdupes
or
apt-get install fdupes
Usage of fdupes is pretty straight forward:
fdupes /path/to/dir
If you just need the .txt files, you can pipe the result to grep, i.e.:
fdupes /path/to/dir | grep .txt

$files is not an array. It is a string.
You are splitting it on whitespace. This is not safe for filenames with spaces.
You are also globbing. This isn't safe for filenames with globbing metacharacters in the names.
See Bash FAQ 001 for how to safely operate over data line-by-line. Also see Don't read lines with for.
You can also get find to spit out arbitrarily formatted output with the -printf argument. (i.e. -printf %f will print out just the file name (no path information).)
You don't need echo for that variable assignment. (i.e. formatted=${file##*/} works just fine.)
$formatted contains a single filename. You can't really sort or uniq a single item.
Putting all the above together and assuming that you want to detect duplicates by suffix-less name (and not file contents) then...
If you aren't worried about filenames with newlines then you can just use this:
find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \) -printf %f | sort | uniq -c
If you are worried about them then you need to read the lines manually (something like this for bash 4+):
declare -A files
while IFS= read -r -d '' file; do
((files["$file"]+=1))
done <(find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \) -printf '%f\0')
declare -p files

Related

Parameter expansion to remove strings with multiple patterns

I'm building a script that sends the "find" command output to a temp file and from there I use "read" to iterate through all the paths and print two fields into a csv file, one field for the name of the file and the other for the complete path.
find -type f \( -iname \*.mp4 -o -iname \*.mkv \) >> $tempfile
while read -r file; do
printf '%s\n' ${file##*/} ${file} | paste -sd ' ' >> $csvfile
done < $tempfile
rm $tempfile
The problem is in the field for the names ${file##*/}. Some files have spaces in their names and this is causing they not being printed correctly in the csv file, I know I could use this ${file//[[:blank:]]/} to remove the spaces but I also need to preserve this ${file##*/} since that parameter expansion allows me to cut all but the name itself of my files (and print those in the first field of the csv file).
I was searching for a way to kinda join the two parameter expansion ${file##*/} and ${file//[[:blank:]]/} but I didn't found anything related. Is it possible to solve this using only parameter expansion?, if no what other solutions can fix this? maybe regex?
Edit: Also I will need to add a 3rd field in which the value will depend on a variable.
If you're using GNU find (And possibly other implementations?) it can be simplified a lot:
find dir/ -type f \( -iname "*.mp4" -o -iname "*.mkv" \) \
-printf '"%f","'"${newvar//%/%%}"'","%p"\n' > "$csvfile"
I put quotes around the fields of the CSV output, to handle cases where the filenames might have commas in them. It'll still have an issue with filenames with doublequotes in the name, though.
If using some other version of find... well, there's no need for a temporary file. Just pipe the output directly to your while loop:
find test1/ -type f \( -iname "*.mp4" -o -iname "*.mkv" \) -print0 |
while IFS= read -d '' -r file; do
name=$(basename "$file")
printf '"%s","%s","%s"\n' "${name//\"/\"\"}" "$newvar" "${file//\"/\"\"}"
done > "$csvfile"
This one will escape double quotes appearing in the filename, so if that's the case with your files, prefer it.

bash shell: recursively search folder and subfolders for files from list

Until now when I want to gather files from a list I have been using a list that contains full paths and using:
cat pathlist.txt | xargs -I % cp % folder
However, I would like be able to recursively search through a folder and it's subfolders and copy all files that are in a plain text list of just filenames (not full paths).
How would I go about doing this?
Thanks!
Assuming your list of file names contains bare file names, as would be suitable for passing as an argument to find -name, you can do just that.
sed 's/^/-name /;1!s/^/-o /' pathlist.txt |
xargs -I % find folders to search -type f \( % \) -exec cp -t folder \+
If your cp doesn't support the -t option for specifying the destination folder before the sources, or your find doesn't support -exec ... \+ you will need to adapt this.
Just to explain what's going on here, the input
test.txt
radish.avi
:
is being interpolated into
find folders to search -type f \( -name test.txt -o -name radish.avi \
-o name : \) -exec cp -t folder \+
Try something like
find folder_to_search -type f | grep -f pattern_file | xargs -I % cp % folder
Use the find command.
while read line
do
find /path/to/search/for -type f -name "$line" -exec cp -R {} /path/to/copy/to \;
done <plain_text_file_containing_file_names
Assumption:
The files in the list have standard names without, say newlines or special characters in them.
Note:
If the files in the list have non-standard filenames, tt will be different ballgame. For more information see find manpage and look for -print0. In short you should be operating with null terminated strings then.

grep cannot read filename after find folders with spaces

Hi after I find the files and enclose their name with double quotes with the following command:
FILES=$(find . -type f -not -path "./.git/*" -exec echo -n '"{}" ' \; | tr '\n' ' ')
I do a for loop to grep a certain word inside each file that matches find:
for f in $FILES; do grep -Eq '(GNU)' $f; done
but grep complains about each entry that it cannot find file or directory:
grep: "./test/test.c": No such file or directory
see picture:
whereas echo $FILES produces:
"./.DS_Store" "./.gitignore" "./add_license.sh" "./ads.add_lcs.log" "./lcs_gplv2" "./lcs_mit" "./LICENSE" "./new test/test.js" "./README.md" "./sxs.add_lcs.log" "./test/test.c" "./test/test.h" "./test/test.js" "./test/test.m" "./test/test.py" "./test/test.pyc"
EDIT
found the answer here. works perfectly!
The issue is that your array contains filenames surrounded by literal " quotes.
But worse, find's -exec cmd {} \; executes cmd separately for each file which can be inefficient. As mentioned by #TomFenech in the comments, you can use -exec cmd {} + to search as many files within a single cmd invocation as possible.
A better approach for recursive search is usually to let find output filenames to search, and pipe its results to xargs in order to grep inside as many filenames together as possible. Use -print0 and -0 respectively to correctly support filenames with spaces and other separators, by splitting results by a null character instead - this way you don't need quotes, reducing possibility of bugs.
Something like this:
find . -type f -not -path './.git/*' -print0 | xargs -0 egrep '(GNU)'
However in your question you had grep -q in a loop, so I suspect you may be looking for an error status (found/not found) for each file? If so, you could use -l instead of -q to make grep list matching filenames, and then pipe/send that output to where you need the results.
find . -print0 | xargs -0 egrep -l pattern > matching_filenames
Also note that grep -E (or egrep) uses extended regular expressions, which means parentheses create a regex group. If you want to search for files containing (GNU) (with the parentheses) use grep -F or fgrep instead, which treats the pattern as a string literal.

BASH: find and rename files & directories

I would like to replace :2f with a - in all file/dir names and for some reason the one-liner below is not working, is there any simpler way to achieve this?
Directory name example:
AN :2f EXAMPLE
Command:
for i in $(find /tmp/ \( -iname ".*" -prune -o -iname "*:*" -print \)); do { mv $i $(echo $i | sed 's/\:2f/\-/pg'); }; done
You don't have to parse the output of find:
find . -depth -name '*:2f*' -execdir bash -c 'echo mv "$0" "${0//:2f/-}"' {} \;
We're using -execdir so that the command is executed from within the directory containing the found file. We're also using -depth so that the content of a directory is considered before the directory itself. All this to avoid problems if the :2f string appears in a directory name.
As is, this command is harmless and won't perform any renaming; it'll only show on the terminal what's going to be performed. Remove echo if you're happy with what you see.
This assumes you want to perform the renaming for all files and folders (recursively) in current directory.
-execdir might not be available for your version of find, though.
If your find doesn't support -execdir, you can get along without as so:
find . -depth -name '*:2f*' -exec bash -c 'dn=${0%/*} bn=${0##*/}; echo mv "$dn/$bn" "$dn/${bn//:2f/-}"' {} \;
Here, the trick is to separate the directory part from the filename part—that's what we store in dn (dirname) and bn (basename)—and then only change the :2f in the filename.
Since you have filenames containing space, for will split these up into separate arguments when iterating. Pipe to a while loop instead:
find /tmp/ \( -iname ".*" -prune -o -iname "*:*" -print \) | while read -r i; do
mv "$i" "$(echo "$i" | sed 's/\:2f/\-/pg')"
Also quote all the variables and command substitutions.
This will work as long as you don't have any filenames containing newline.

What's a more concise way of finding text in a set of files?

I currently use the following command, but it's a little unwieldy to type. What's a shorter alternative?
find . -name '*.txt' -exec grep 'sometext' '{}' \; -print
Here are my requirements:
limit to a file extension (I use SVN and don't want to be searching through all those .svn directories)
can default to the current directory, but it's nice to be able to specify a different directory
must be recursive
UPDATE: Here's my best solution so far:
grep -r 'sometext' * --include='*.txt'
UPDATE #2: After using grep for a bit, I realized that I like the output of my first method better. So, I followed the suggestions of several responders and simply made a shell script and now I call that with two parameters (extension and text to find).
grep has -r (recursive) and --include (to search only in files and directories matching a pattern).
If its too unweildy, write a script that does it and put it in your personal bin directory. I have a 'fif' script which searches source files for text, basically just doing a single find like you have here:
#!/bin/bash
set -f # disable pathname expansion
pattern="-iname *.[chsyl] -o -iname *.[ch]pp -o -iname *.hh -o -iname *.cc
-o -iname *.java -o -iname *.inl"
prune=""
moreargs=true
while $moreargs && [ $# -gt 0 ]; do
case $1 in
-h)
pattern="-iname *.h -o -iname *.hpp -o -iname *.hh"
shift
;;
-prune)
prune="-name $2 -prune -false -o $prune"
shift
shift
;;
*)
moreargs=false;
;;
esac
done
find . $prune $pattern | sed 's/ /\\ /g' | xargs grep "$#"
it started life as a single-line script and got features added over the years as I needed them.
This is much more efficient since it invokes grep many fewer times, though it's hard to say it's more succinct:
find . -name '*.txt' -print0 | xargs -0 grep 'sometext' /dev/null
Notes:
/find -print0 and xargs -0 makes pathnames with embedded blanks work correctly.
The /dev/null argument makes sure grep always prepends a filename.
Install ack and use
ack -aG'\.txt$' 'sometext'
I second ephemient's suggestion of ack. I'm writing this post to highlight a particular issue.
In response to jgormley (in the comments): ack is available as a single file which will work wherever the right Perl version is installed (which is everywhere).
Given that on non-Linux platforms grep regularly does not accept -R, arguably using ack is more portable.
I use zsh, which has recursive globbing. If you needed to look at specific filetypes, the following would be equivalent to your example:
grep 'sometext' **/*.txt
If you don't care about the filetype, the -r option will be better:
grep -r 'sometext' *
Although, A minor tweak to your original example will give you exactly what you want:
find . -name '*.txt' \! -wholename '*/.svn/*' -exec grep 'sometext' '{}' \; -print
If this is something you do frequently, make it a function (put this in your shell config):
function grep_no_svn {
find . -name "${2:-*}" \! -wholename '*/.svn/*' -exec grep "$1" '{}' \; -print
}
Where the first argument to the function is the text you're searching for. So:
$ grep_here_no_svn "sometext"
Or:
$ grep_here_no_svn "sometext" "*.txt"
You could write a script (in bash or whatever -- I have one in Groovy) and place it on the path. E.g.
$ myFind.sh txt targetString
where myFind.sh is:
find . -name "*.$1" -exec grep $2 {} \; -print
I usualy avoid the "man find" by using grep $(find . -name "*,txt")
You say that you like the output of your method (using find) better. The only difference I can see between them is that grepping multiple files will put the filename on the front.
You can always (in GNU grep, but you must be using that or -r and --include wouldn't work) turn the filename off by using -h (--no-filename). The opposite, for anyone who does want filenames but has to use find for some other reason, is -H (--with-filename).

Resources