Find duplicate folder names and ignore further subdirectory matches - bash

I have a data directory where duplicate folders have been created at various hierarchy levels. I want to find these duplicates using case insensitive comparison. I also want to ignore further subdirectory matches for every match, so as to reduce the noise, since in my usecase directory names are quite unique and indicate that all subdirectories are also likely to be duplicates (basically a given folder and all its subfolders has been copied at various locations). In other words, if 2 directories match but their parents, while being in different paths, also match, just return the parents.
For example, if I have the following folders:
a/b/Dup1
a/b/dup1/ignore1
c/dup1/ignore1
I want the output to be something like:
'dup1' found in:
a/b/Dup1
a/b/dup1
c/dup1
Any other output that lists all the case-insensitive duplicates and their relative or absolute path is also acceptable (or even just the name of the duplicate directory, since it should be easy to list its locations from there).

You can try combining find, sort and uniq tools to get desired result. This one liner should work:
find yourstartingpath -type d -printf '%f\n' |sort -i |uniq -icd |while read c x;do echo "===== $x found $c times in =====";find yourstartingpath -type d -iname "$x" -printf '%p\n';done
This oneliner will not remove subdirectories from search, but I believe it would be hard to do so with only bash core utilities. If it is necessary, depending on your system probably it would be better to write program in either perl or python to get what you need. Eventually you may restrict your results to folders, which have some common subdirectories, ie. using following oneliner:
find yourstartingpath -type d -printf '%P\n' |awk -F/ -v OFS='/' 'NF>=4 {print $(NF-1),$NF}' |sort -i |uniq -id |while read x;do echo "======= $x "; find yourstartingpath -type d -ipath "*$x";done
but it will remove Dup1 folder from your example (because it does not have ignore1 subfolder).

Related

Bash; check if a folder exist while the folders are numbered

I have a series of folders (A1, A2, ...) and some subfolders, but I only need to check the subfolders which follow this pattern 0$n_st* and I do not need to check the rest of the subfolders:
A1/
01_stjhs_lkk/
02_stlkd_ooe/
03_stoie_akwe/
...
A2/
01_stpw_awq/
02_stoe_iwoq/
03_stak_weri/
...
...
I want to find the subfolder which has largest number (0$n) (the number of subfolders varies among different folders), then go to the subfolder and grep something and repeat the process over other folders (A1, A2, ...) here is my script which does not work (seems the if condition has some problem)
for dd in */; do
cd "$dd" # A1, A2,...
for num in {8,7,6,5,4,3,2,1}; do
if [ -d 0${num}_st* ]; then
echo "0${num}_st*"
cd "0${num}_st*"
echo $(grep -i 'xyz f' 04_*.log) #grep this line from log file
break
fi
cd ..
done
cd ..
done
The immediate problem is that if [ -d ... ] will produce a syntax error if ... is a wildcard which matches more than one file. You can work around this in various ways, but probably the simplest which matches your (vague) requirements is something like
for dd in */; do
for dir in "$dd"/0[0-9]_st*/; do
: nothing
done
# the variable `$dir` will now contain the alphabetically last match
grep -i 'xyz f' "$dir"/04_*.log
done
If the directories contain different numbers of digits in their names, sorting them alphabetically will not work (2 will come after 19) but your examples only show names with two digits in all cases so let's assume that's representative.
Demo: https://ideone.com/1N2Iui
Here's a variation which exhibits a different way to find the biggest number by using sort -n and which thus should work for directories with variable numbers of digits, too.
for dd in */; do
biggest=$(printf '%s\n' "$dd"/0*_st*/ | sort -d / -k2,2n | tail -n 1)
grep -i 'xyz f' "$biggest"/04_*.log
done
Because the wildcards already end with /, the / after e.g. "$dd"/ is strictly speaking redundant, but harmless. You can take it out (at the cost of some legibility) if it disturbs you (or you are on a weird system where double slashes have a special meaning to the file system).
Suggesting find command.
find . -type d -printf "%f\n" |grep "^0"|sort -n|tail -n1
Notice this find command scans recursively all directories under current directory.
In order to limit find command to specific directories dir1 dir2 you need to specify them before -type option.
find dir1 dir2 -type d -printf "%f\n" |grep "^0"|sort -n|tail -n1
Explanation
find . -type d
Prints all all the directories and subdirectories under current directory
find . -type d -printf "%f\n"
Prints only the directory name, not the directory path.
grep "^0"
Filter only directory names starting with 0
If matching more than required directories,
possibly refine grep filter: grep "^0[[:digit:]]\+_" as well.
sort -n
Sort directory names numerically
tail -n1
Print the last directory

Given a text file with file names, how can I find files in subdirectories of the current directory?

I have a bunch of files with different names in different subdirectories. I created a txt file with those names but I cannot make find to work using the file. I have seen posts on problems creating the list, on not using find (do not understand the reason though). Suggestions? Is difficult for me to come up with an example because I do not know how to reproduce the directory structure.
The following are the names of the files (just in case there is a formatting problem)
AO-169
AO-170
AO-171
The best that I came up with is:
cat ExtendedList.txt | xargs -I {} find . -name {}
It obviously dies in the first directory that it finds.
I also tried
ta="AO-169 AO-170 AO-171"
find . -name $ta
but it complains find: AO-170: unknown primary or operator
If you are trying to ask "how can I find files with any of these names in subdirectories of the current directory", the answer to that would look something like
xargs printf -- '-o\0-name\0%s\0' <ExtendedList.txt |
xargs -r0 find . -false
The -false is just a cute way to let the list of actual predicates start with "... or".
If the list of names in ExtendedList.txt is large, this could fail if the second xargs decides to break it up between -o and -name.
The option -0 is not portable, but should work e.g. on Linux or wherever you have GNU xargs.
If you can guarantee that the list of strings in ExtendedList.txt does not contain any characters which are problematic to the shell (like single quotes), you could simply say
sed "s/.*/-o -name '&'/" ExtendedList.txt |
xargs -r find . -false

How do I filter down a subset of files based upon time?

Let's assume I have done lots of work whittling down a list of files in a directory down to the 10 files that I am interested in. There were hundreds of files, and I have finally found the ones I need.
I can either pipe out the results of this (piping from ls), or I can say I have an array of those values (doing this inside a script). Doesn't matter either way.
Now, of that list, I want to find only the files that were created yesterday.
We can use tools like find -mtime 1 which are fine. But how would we do that with a subset of files in a directory? Can we pass a subset to find via xargs?
I can do this pretty easily with a for loop. But I was curious if you smart people knew of a one-liner.
If they're in an array:
files=(...)
find "${files[#]}" -mtime 1
If they're being piped in:
... | xargs -d'\n' -I{} find {} -mtime 1
Note that the second one will run a separate find command for each file which is a bit inefficient.
If any of the items are directories and you don't want to search inside of them, add -maxdepth 0 to disable find's recursion.
Another option that won't recurse, though I'd just use John's find solution if I were you.
$: stat -c "%n %w" "${files[#]}" | sed -n "
/ $(date +'%Y-%m-%d' --date=yesterday) /{ s/ .*//; p; }"
The stat will print the name and creation date of files in the array.
The sed "greps" for the date you want and strips the date info before printing the filename.

Find files in current directory, list differences from list within script

I am attempting to find differences for a directory and a list of files located in the bash script, for portability.
For example, search a directory with phpBB installed. Compare recursive directory listing to list of core installation files (excluding themes, uploads, etc). Display additional and missing files.
Thus far, I have attempted using diff, comm, and tr with "argument too long" errors. This is likely due to the lists being a list of files it is attempting to compare the actual files rather than the lists themselves.
The file list in the script looks something like this (But I am willing to format differently):
./file.php
./file2.php
./dir/file.php
./dir/.file2.php
I am attempting to use one of the following to print the list:
find ./ -type f -printf "%P\n"
or
find ./ -type f -print
Then use any command you can think of to compare the results to the list of files inside the script.
The following are difficult to use as there are often 1000's of files to check and each version can change the listings and it is a pain to update a whole script every time there is a new release.
find . ! -wholename './file.php' ! -wholename './file2.php'
find . ! -name './file.php' ! -name './file2.php'
find . ! -path './file.php' ! -path './file2.php'
With the lists being in different orders to accommodate any additional files, it can't be a straight comparison.
I'm just stumped. I greatly appreciate any advice or if I could be pointed in the right direction. Ask away for clarification!
You can use the -r option of diff command, to recursively compare the contents of the two directories. This way you don't need all the file names on the command line; just the two top level directory names.
It will give you missing files, newly added files, and the difference of changed files. Many things can be controlled by different options.
If you mean you have a list of expected files somewhere, and only one directory to be compared against it, then you can try using the tree command. The list can be first created using the tree command, and then at the time of comparison you can run the tree command again on the directory, and compare it with the stored "expected output" using the diff command.
Do you have to use coreutils? If so:
Put your list in a file, say list.txt, with one file path per line.
comm -23 <(find path/to/your/directory -type f | sort) \
<(sort path/to/list.txt) \
> diff.txt
diff.txt will have one line per file in path/to/your/directory that is not in your list.
If you care about files in your list that are not in path/to/your/directory, do comm -13 with the same parameters.
Otherwise, you can also use sd (stream diff), which doesn't require sorting nor process substitution and supports infinite streams, like so:
find path/to/your/directory -type f | sd 'cat path/to/list.txt' > diff.txt
And just invert the streams to get the second requirement, like so:
cat path/to/list.txt | sd 'find path/to/your/directory -type f' > diff.txt
Probably not that much of a benefit on this example other than succintness, but still consider it; in some cases you won't be able to use comm nor grep -F nor diff.
Here's a blogpost I wrote about diffing streams on the terminal, which introduces sd.

Recursively find all files that match a certain pattern

I need to find (or more specifically, count) all files that match this pattern:
*/foo/*.doc
Where the first wildcard asterisk includes a variable number of subdirectories.
With gnu find you can use regex, which (unlike -name) match the entire path:
find . -regex '.*/foo/[^/]*.doc'
To just count the number of files:
find . -regex '.*/foo/[^/]*.doc' -printf '%i\n' | wc -l
(The %i format code causes find to print the inode number instead of the filename; unlike the filename, the inode number is guaranteed to not have characters like a newline, so counting is more reliable. Thanks to #tripleee for the suggestion.)
I don't know if that will work on OSX, though.
how about:
find BASE_OF_SEARCH/*/foo -name \*.doc -type f | wc -l
What this is doing:
start at directory BASE_OF_SEARCH/
look in all directories that have a directory foo
look for files named like *.doc
count the lines of the result (one per file)
The benefit of this method:
not recursive nor iterative (no loops)
it's easy to read, and if you include it in a script it's fairly easy to decipher (regex sometimes is not).
UPDATE: you want variable depth? ok:
find BASE_OF_SEARCH -name \*.doc -type f | grep foo | wc -l
start at directory BASE_OF_SEARCH
look for files named like *.doc
only show the lines of this result that include "foo"
count the lines of the result (one per file)
Optionally, you could filter out results that have "foo" in the filename, because this will show those too.
Based on the answers on this page on other pages I managed to put together the following, where a search is performed in the current folder and all others under it for all files that have the extension pdf, followed by a filtering for those that contain test_text on their title.
find . -name "*.pdf" | grep test_text | wc -l
Untested, but try:
find . -type d -name foo -print | while read d; do echo "$d/*.doc" ; done | wc -l
find all the "foo" directories (at varying depths) (this ignores symlinks, if that's part of the problem you can add them); use shell globbing to find all the ".doc" files, then count them.

Resources