Optimal way to recursively find files that match one or more patterns - bash

I have to optimize a shell script, but after one week, i didn't succeed to optimize it enough.
I have to search recursively for .c .h and .cpp file in a directory, and check if word like this exist:
"float short unsigned continue for signed void default goto sizeof volatile do if static while"
words=$(echo $# | sed 's/ /\\|/g')
files=$(find $dir -name '*.cpp' -o -name '*.c' -o -name '*.h' )
for file in $files; do
(
test=$(grep -woh "$words" "$file" | sort -u | awk '{print}' ORS=' ')
if [ "$test" != "" ] ; then
echo "$(realpath $file) contains : $test"
fi
)&
done
wait
I have tried with xargs and -exec, but no result, i have to keep this result format:
/usr/include/c++/6/bits/stl_set.h contains : default for if void
Maybe you can help me (to optimize it)..
EDIT: I have to keep one occurence of each word
YES: while, for, volatile...
NOPE: while, for, for, volatile...

If you are interested in finding all files that have at least one match of any of your patterns, you can use globstar:
shopt -s globstar
oldIFS=$IFS; IFS='|'; patterns="$*"; IFS=$oldIFS # make a | delimited string from arguments
grep -lwE "$patterns" **/*.c **/*.h **/*.cpp # list files with matching patterns
globstar
If set, the pattern ‘**’ used in a filename expansion context
will match all files and zero or more directories and subdirectories.
If the pattern is followed by a ‘/’, only directories and
subdirectories match.

Here is an approach that keeps your desired format while eliminating the use of find and bash looping:
words='float|short|unsigned|continue|for|signed|void|default|goto|sizeof|volatile|do|if|static|while'
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path | awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last && !($2 in a){printf " %s",$2; a[$2]} END{print""}'
How it works
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path
This searches recursively through directories starting with path looking only in files whose names match the globs *.[ch] or *.cpp.
awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last{printf " %s",$2} END{print""}'
This awk command reformats the output of grep to match your desired output. The script uses a variable last and array a. last keeps track of which file we are on and a contains the list of words seen so far. In more detail:
-F:
This tells awk to use : as the field separator. In this way, the first field is the file name and the second is the word that is found. (limitation: file names that include : are not supported.)
'$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]}
Every time that the file name, $1, does not match the variable last, we start the output for a new file. Then, we update last to contain the name of this new file. We then delete array a and then assign key $2 to a new array a.
$1==last && !($2 in a){printf " %s",$2; a[$2]}
If the current file name is the same as the previous and the current word has not been seen before, we print out the new word found. We also add this word, $2 as a key to array a.
END{print""}
This prints out a final newline (record separator) character.
Multiline version of code
For those who prefer their code spread out over multiple lines:
grep -rwoE \
--include '*.[ch]' \
--include '*.cpp' \
"$words" path |
awk -F: '
$1!=last{
printf "%s%s: contains %s",r,$1,$2
last=$1
r=ORS
delete a
a[$2]
}
$1==last && !($2 in a){
printf " %s",$2; a[$2]
}
END{
print""
}'

You should be able to do most of this with a single grep command:
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words"
This will put it in file:word format, so all that's left is to change it to produce the output that you want.
echo $output | sed 's/:/ /g' | awk '{print $1 " contains : " $2}'
Then you can add | sort -u to get only single occurrences for each word in each file.
#!/bin/bash
#dir=.
words=$(echo $# | sed 's/ /\\|/g')
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words" \
| sort -u \
| sed 's/:/ /g' \
| awk '{print $1 " contains : " $2}'

Related

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

Bash: grabbing the second line and last line of output (ls -lrS) only

I am looking to get the second line and last line of what the ls -lrS command outputs. Ive been using ls -lrS | (head -2 | tail -1) && (tail -n1) But it seems to only get the first line only, and I have to press control C to stop it.
Another problem I am having is using the awk command, I wanted to just grab the file size and file name. If I were to get the correct lines (second and last) my desired output would be
files=$(ls -lrS | (head -2 | tail -1) && (tail -n1) awk '{ print "%s", $5; "%s", $8; }' )
I was hoping it would print:
1234 file.abc
12345 file2.abc
Using the format stable GNU stat command:
stat --format='%s %n' * | sort -n | sed -n '1p;$p'
If you're using BSD stat, adjust accordingly.
If you want a lot more control over what files go into this calculation, and arguably better portability, use find. In this example, I'm getting all non-dot files in the current directory:
find -maxdepth 1 -not -path '*/\.*' -printf '%s %p\n' | sort -n | sed -n '1p;$p'
And take care if your directory contains two or fewer entries, or if any of your entries have a new-line in their name.
Using awk:
ls -lrS | awk 'NR==2 { print; } END { print; }'
It prints when the line number NR is 2 and again on the final line.
Note: As pointed out in the comments, $0 may or may not be available in an END block depending on your awk version.
whatever | awk 'NR==2{x=$0;next} {y=$0} END{if (x!="") print x; if (y!="") print y}'
You need that complexity (and more to be REALLY robust) to handle input that's less than 3 lines.
ls is not a reliable tool for this job: It can't represent all possible filenames (spaces are possible, but also newlines and other special characters -- all but NUL). One robust solution on a system with GNU tools is to use find:
{
# read the first size and name
IFS= read -r -d' ' first_size; IFS= read -r -d '' first_name;
# handle case where only one file exists
last_size=$first_size; last_name=$first_name
# continue reading "last" size and name, until one really is last
while IFS= read -r -d' ' curr_size && IFS= read -r -d '' curr_name; do
last_size=$curr_size; last_name=$curr_name
done
} < <(find . -mindepth 1 -maxdepth 1 -type f -printf '%s %P\0' | sort -n -z)
The above puts results into variables $first_size, $first_name, $last_size and $last_name, usable thusly:
printf 'Smallest file is %d bytes, named %q\n' "$first_size" "$first_name"
printf 'Largest file is %d bytes, named %q\n' "$last_size" "$last_name"
In terms of how it works:
find ... -printf '%s %P\0'
...emits a stream of the following form from find:
<size> <name><NUL>
Running that stream through sort -n -z does a numeric sort on its contents. IFS= read -r -d' ' first_size reads the everything up to the first space; IFS= read -r -d '' first_name reads everything up to the first NUL; and then the loop continues to read and store additional size/name pairs until the last one is reached.

How to find files in UNIX which have a multiple-line pattern?

I'm trying to search all files for a pattern that spans multiple lines, and then return a list of file names that match the pattern.
I'm using this line:
find . -name "$file_to_check" 2>/dir1/null | xargs grep "$2" >> $grep_out
This will create a list of files and the line the matched pattern is found on within $grep_out. The problem with this is that the search doesn't span multiple lines. I've read that grep cannot span multiple lines, so I'm looking to replace grep with sed or awk.
The only thing I think that needs to be changed is the grep. I've found that grep can't search for a pattern across multiple lines, so I'm looking to use sed or awk. When I use these commands from the terminal, I get a large printout of the file matching the pattern I've given sed. All I want is the filename, not the context of the pattern. Is there a way to retrieve this - perhaps have sed print out the filename rather than the context? Or, have sed return true/false when it finds a match, and then I can save the current filename that was used to do the search.
Most text processing tools are line-oriented by default. If we choose to read records as paragraphs, using blank lines as record separators:
awk -v RS= -v pattern="$2" '$0 ~ pattern {print FILENAME; exit}' file
or
find . -options ... -print0 | xargs -0 awk -v RS= -v pattern="$2" '$0 ~ pattern {print FILENAME; exit}'
I'm assuming your pattern does not contain consecutive newlines (i.e. blank lines)
To check if a file contains "word1[anything]word2[anything]word3"
brute force: read the entire file and then to a regex comparison: with bash
contents=$(< "$file")
if [[ $contents =~ "$word1".*"$word2".*"$word3" ]]; then
echo "match"
else
echo "no match"
fi
2. line-by-line with awk, use a state machine
awk -v w1="$word1" -v w2="$word2" -v w3="$word3" '
$0 ~ w1 {have_w1 = 1}
have_w1 && $0 ~ w2 {have_w2 = 1}
have_w2 && $0 ~ w3 {have_w3 = 1; exit}
END {exit (! have_w3)}
' filename
Ah, strike #2: that would match the line "word3word2word1" -- does not enforce order of the words
I'm trying to search all files for a pattern that spans multiple lines, and then return a list of file names that match the pattern.
pattern=$( echo "whatever your search pattern is" | tr '\n' ' ' )
for FILE in *
do
tr '\n' ' ' <"$FILE" | if grep "$pattern" then; echo $FILE; fi
done
Just replace the newlines for spaces both in your pattern and your grep-input
With 'find' , you could do it like this:
#!/bin/bash
find . -name "$file_to_check" 2>/dir1/null | while read FILE
do
tr '\n' ' ' <"$FILE" | if grep -q "word1.*word2.*word3" ; then echo "$FILE" ; fi
done >grep_out
As for the search pattern: ".*" means "any amount of any character"
Remember that a searchpattern in grep always wants to have certain characters escaped like "." becomes "\." and "^" becomes "\^"

Bash Shell awk/xargs magic

I'm trying to learn a little awk foo. I have a CSV where each line is of the format partial_file_name,file_path. My goal is to find the files (based on partial name) and move them to their respective new paths. I wanted to combine the forces of find,awk and mv to achieve this but I'm stuck implementing. I wanted to use awk to separate the terms from the csv file so that I could do something like
find . -name '*$1*' -print | xargs mv {} $2{}
where $1 and $2 are the split terms from the csv file. Anyone have any ideas? -peace
I suggest doing this:
$ cat korv
foo.txt,/hello/
bar.jpg,/mullo/
$ awk -F, '{print $1 " " $2}' korv
foo.txt /hello/
bar.jpg /mullo/
-F sets the delimiter, so the above will split using ",". Next, add * to the filenames:
$ awk -F, '{print "*"$1"*" " " $2}' korv
*foo.txt* /hello/
*bar.jpg* /mullo/
**
This shows I have an empty line. We don't want this match, so we add a rule:
$ awk -F, '/[a-z]/{print "*"$1"*" " " $2}' korv
*foo.txt* /hello/
*bar.jpg* /mullo/
Looks good, so encapsulate all this to mv using a subshell:
$ mv $(awk -F, '/[a-z]/{print "*"$1"*" " " $2}' korv)
$
Done.
You don't really need awk for this. There isn't really anything here which awk does better than the shell.
#!/bin/sh
IFS=,
while read file target; do
find . -name "$file" -print0 | xargs -ir0 mv {} "$target"
done <path_to_csv_file
If you have special characters in the file names, you may need to tweak the read.
what about using awk's system command:
awk '{ system("find . -name " $1 " -print | xargs -I {} mv {} " $2 "{}"); }'
example arguments in the csv file: test.txt ./subdirectory/
find . -name "*err" -size "+10c" | awk -F.err '{print $1".job"}' | xargs -I {} qsub {}

Resources