Recursively find all files that match a certain pattern - macos

I need to find (or more specifically, count) all files that match this pattern:
*/foo/*.doc
Where the first wildcard asterisk includes a variable number of subdirectories.

With gnu find you can use regex, which (unlike -name) match the entire path:
find . -regex '.*/foo/[^/]*.doc'
To just count the number of files:
find . -regex '.*/foo/[^/]*.doc' -printf '%i\n' | wc -l
(The %i format code causes find to print the inode number instead of the filename; unlike the filename, the inode number is guaranteed to not have characters like a newline, so counting is more reliable. Thanks to #tripleee for the suggestion.)
I don't know if that will work on OSX, though.

how about:
find BASE_OF_SEARCH/*/foo -name \*.doc -type f | wc -l
What this is doing:
start at directory BASE_OF_SEARCH/
look in all directories that have a directory foo
look for files named like *.doc
count the lines of the result (one per file)
The benefit of this method:
not recursive nor iterative (no loops)
it's easy to read, and if you include it in a script it's fairly easy to decipher (regex sometimes is not).
UPDATE: you want variable depth? ok:
find BASE_OF_SEARCH -name \*.doc -type f | grep foo | wc -l
start at directory BASE_OF_SEARCH
look for files named like *.doc
only show the lines of this result that include "foo"
count the lines of the result (one per file)
Optionally, you could filter out results that have "foo" in the filename, because this will show those too.

Based on the answers on this page on other pages I managed to put together the following, where a search is performed in the current folder and all others under it for all files that have the extension pdf, followed by a filtering for those that contain test_text on their title.
find . -name "*.pdf" | grep test_text | wc -l

Untested, but try:
find . -type d -name foo -print | while read d; do echo "$d/*.doc" ; done | wc -l
find all the "foo" directories (at varying depths) (this ignores symlinks, if that's part of the problem you can add them); use shell globbing to find all the ".doc" files, then count them.

Related

Bash; check if a folder exist while the folders are numbered

I have a series of folders (A1, A2, ...) and some subfolders, but I only need to check the subfolders which follow this pattern 0$n_st* and I do not need to check the rest of the subfolders:
A1/
01_stjhs_lkk/
02_stlkd_ooe/
03_stoie_akwe/
...
A2/
01_stpw_awq/
02_stoe_iwoq/
03_stak_weri/
...
...
I want to find the subfolder which has largest number (0$n) (the number of subfolders varies among different folders), then go to the subfolder and grep something and repeat the process over other folders (A1, A2, ...) here is my script which does not work (seems the if condition has some problem)
for dd in */; do
cd "$dd" # A1, A2,...
for num in {8,7,6,5,4,3,2,1}; do
if [ -d 0${num}_st* ]; then
echo "0${num}_st*"
cd "0${num}_st*"
echo $(grep -i 'xyz f' 04_*.log) #grep this line from log file
break
fi
cd ..
done
cd ..
done
The immediate problem is that if [ -d ... ] will produce a syntax error if ... is a wildcard which matches more than one file. You can work around this in various ways, but probably the simplest which matches your (vague) requirements is something like
for dd in */; do
for dir in "$dd"/0[0-9]_st*/; do
: nothing
done
# the variable `$dir` will now contain the alphabetically last match
grep -i 'xyz f' "$dir"/04_*.log
done
If the directories contain different numbers of digits in their names, sorting them alphabetically will not work (2 will come after 19) but your examples only show names with two digits in all cases so let's assume that's representative.
Demo: https://ideone.com/1N2Iui
Here's a variation which exhibits a different way to find the biggest number by using sort -n and which thus should work for directories with variable numbers of digits, too.
for dd in */; do
biggest=$(printf '%s\n' "$dd"/0*_st*/ | sort -d / -k2,2n | tail -n 1)
grep -i 'xyz f' "$biggest"/04_*.log
done
Because the wildcards already end with /, the / after e.g. "$dd"/ is strictly speaking redundant, but harmless. You can take it out (at the cost of some legibility) if it disturbs you (or you are on a weird system where double slashes have a special meaning to the file system).
Suggesting find command.
find . -type d -printf "%f\n" |grep "^0"|sort -n|tail -n1
Notice this find command scans recursively all directories under current directory.
In order to limit find command to specific directories dir1 dir2 you need to specify them before -type option.
find dir1 dir2 -type d -printf "%f\n" |grep "^0"|sort -n|tail -n1
Explanation
find . -type d
Prints all all the directories and subdirectories under current directory
find . -type d -printf "%f\n"
Prints only the directory name, not the directory path.
grep "^0"
Filter only directory names starting with 0
If matching more than required directories,
possibly refine grep filter: grep "^0[[:digit:]]\+_" as well.
sort -n
Sort directory names numerically
tail -n1
Print the last directory

Find duplicate folder names and ignore further subdirectory matches

I have a data directory where duplicate folders have been created at various hierarchy levels. I want to find these duplicates using case insensitive comparison. I also want to ignore further subdirectory matches for every match, so as to reduce the noise, since in my usecase directory names are quite unique and indicate that all subdirectories are also likely to be duplicates (basically a given folder and all its subfolders has been copied at various locations). In other words, if 2 directories match but their parents, while being in different paths, also match, just return the parents.
For example, if I have the following folders:
a/b/Dup1
a/b/dup1/ignore1
c/dup1/ignore1
I want the output to be something like:
'dup1' found in:
a/b/Dup1
a/b/dup1
c/dup1
Any other output that lists all the case-insensitive duplicates and their relative or absolute path is also acceptable (or even just the name of the duplicate directory, since it should be easy to list its locations from there).
You can try combining find, sort and uniq tools to get desired result. This one liner should work:
find yourstartingpath -type d -printf '%f\n' |sort -i |uniq -icd |while read c x;do echo "===== $x found $c times in =====";find yourstartingpath -type d -iname "$x" -printf '%p\n';done
This oneliner will not remove subdirectories from search, but I believe it would be hard to do so with only bash core utilities. If it is necessary, depending on your system probably it would be better to write program in either perl or python to get what you need. Eventually you may restrict your results to folders, which have some common subdirectories, ie. using following oneliner:
find yourstartingpath -type d -printf '%P\n' |awk -F/ -v OFS='/' 'NF>=4 {print $(NF-1),$NF}' |sort -i |uniq -id |while read x;do echo "======= $x "; find yourstartingpath -type d -ipath "*$x";done
but it will remove Dup1 folder from your example (because it does not have ignore1 subfolder).

find/grep to list found specific file that contains specific string

I have a root directory that I need to run a find and/or grep command on to return a list of files that contain a specific string.
Here's an example of the file and directory set up. In reality, this root directory contains a lot of subdirectories that each have a lot of subdirectories and files, but this example, I hope, gets my point across.
From root, I need to go through each of the children directories, specifically into subdir/ and look through file.html for the string "example:". If a result is found, I'd like it to print out the full path to file.html, such as website_two/subdir/file.html.
I figured limiting the search to subdir/file.html will greatly increase the speed of this operation.
I'm not too knowledgeable with find and grep commands, but I have tried the following with no luck, but I honestly don't know how to troubleshoot it.
find . -name "file.html" -exec grep -HI "example:" {} \;
EDIT: I understand this may be marked as a duplicate, but I think my question is more along the lines of how can I tell the command to only search a specific file in a specific path, looping through all root-> level directories.
find ./ -type f -iname file.html -exec grep -l "example:" {} \+;
or
grep -Rl "example:" ./ | grep -iE "file.htm(l)*$" will do the trick.
Quote from GNU Grep 2.25 man page:
-R, --dereference-recursive
Read all files under each directory, recursively. Follow all symbolic links, unlike -r.
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have
been printed. The scanning will stop on the first match.
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files.
-E, --extended-regexp
Interpret PATTERN as an extended regular expression.

grep for string in all files in directories with certain names

I have csv files in directories with this structure:
20090120/
20090121/
20090122/
etc...
I want to grep for a certain string in all of the csv files in these directories, but only for January 2009, e.g. 200901*/*.csv
Is there a bash command line argument that can do this?
Yes. grep "a certain string" 200901*/*.csv.
You need something like:
grep NEEDLE 200901*/*.csv
(assuming your search string is NEEDLE of course - just change it to whatever you're actually looking for).
The bash shell is quite capable of expanding multi-level paths and file names.
That is, of course, only limited to the CSV files one directory down. If you want to search entire subtrees, you'll have to use the slightly mode complicated (and adaptable) find command.
Though, assuming you can set a limit on the depth, you could get away with something like (for three levels):
grep NEEDLE 200901*/*.csv 200901*/*/*.csv 200901*/*/*/*.csv
Try this :
grep -lr "Pattern" 200901*/*.csv
Try a combination of find to search for specific filename patterns and grep for finding the pattern:
find . -name "*.csv" -print -exec grep -n "NEEDLE" {} \; | grep -B1 "NEEDLE"

How can I find and count number of files matching a given string?

I want to find and count all the files on my system that begin with some string, say "foo", using only one line in bash.
I'm new to bash so I'd like to avoid scripting if possible - how can I do this using only simple bash commands and maybe piping in just one line?
So far I've been using find / -name foo*. This returns the list of files, but I don't know what to add to actually count the files.
You can use
find / -type f -name 'foo*' | wc -l
Use the single-quotes to prevent the shell from expanding the asterisk.
Use -type f to include only files (not links or directories).
wc -l means "word count, lines only." Since find will list one file per line, this returns the number of files it found.
find / -name foo* | wc -l should do it. Here is a link to man wc. wc -l counts the number of lines
You can pipe it into wc
find / -name foo * | wc -l

Resources