grep for multiple strings in file on different lines (ie. whole file, not line based search)? - bash

I want to grep for files containing the words Dansk, Svenska or Norsk on any line, with a usable returncode (as I really only like to have the info that the strings are contained, my one-liner goes a little further then this).
I have many files with lines in them like this:
Disc Title: unknown
Title: 01, Length: 01:33:37.000 Chapters: 33, Cells: 31, Audio streams: 04, Subpictures: 20
Subtitle: 01, Language: ar - Arabic, Content: Undefined, Stream id: 0x20,
Subtitle: 02, Language: bg - Bulgarian, Content: Undefined, Stream id: 0x21,
Subtitle: 03, Language: cs - Czech, Content: Undefined, Stream id: 0x22,
Subtitle: 04, Language: da - Dansk, Content: Undefined, Stream id: 0x23,
Subtitle: 05, Language: de - Deutsch, Content: Undefined, Stream id: 0x24,
(...)
Here is the pseudocode of what I want:
for all files in directory;
if file contains "Dansk" AND "Norsk" AND "Svenska" then
then echo the filename
end
What is the best way to do this? Can it be done on one line?

You can use:
grep -l Dansk * | xargs grep -l Norsk | xargs grep -l Svenska
If you want also to find in hidden files:
grep -l Dansk .* | xargs grep -l Norsk | xargs grep -l Svenska

Yet another way using just bash and grep:
For a single file 'test.txt':
grep -q Dansk test.txt && grep -q Norsk test.txt && grep -l Svenska test.txt
Will print test.txt iff the file contains all three (in any combination). The first two greps don't print anything (-q) and the last only prints the file if the other two have passed.
If you want to do it for every file in the directory:
for f in *; do grep -q Dansk $f && grep -q Norsk $f && grep -l Svenska $f; done

grep –irl word1 * | grep –il word2 `cat -` | grep –il word3 `cat -`
-i makes search case insensitive
-r makes file search recursive through folders
-l pipes the list of files with the word found
cat - causes the next grep to look through the files passed to it list.

You can do this really easily with ack:
ack -l 'cats' | ack -xl 'dogs'
-l: return a list of files
-x: take the files from STDIN (the previous search) and only search those files
And you can just keep piping until you get just the files you want.

How to grep for multiple strings in file on different lines (Use the pipe symbol):
for file in *;do
test $(grep -E 'Dansk|Norsk|Svenska' $file | wc -l) -ge 3 && echo $file
done
Notes:
If you use double quotes "" with your grep, you will have to escape the pipe like this: \| to search for Dansk, Norsk and Svenska.
Assumes that one line has only one language.
Walkthrough: http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/

awk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print "0" }'
you can then catch the return value with the shell
if you have Ruby(1.9+)
ruby -0777 -ne 'print if /Dansk/ and /Norsk/ and /Svenka/' file

This searches multiple words in multiple files:
egrep 'abc|xyz' file1 file2 ..filen

Simply:
grep 'word1\|word2\|word3' *
see this post for more info

This is a blending of glenn jackman's and kurumi's answers which allows an arbitrary number of regexes instead of an arbitrary number of fixed words or a fixed set of regexes.
#!/usr/bin/awk -f
# by Dennis Williamson - 2011-01-25
BEGIN {
for (i=ARGC-2; i>=1; i--) {
patterns[ARGV[i]] = 0;
delete ARGV[i];
}
}
{
for (p in patterns)
if ($0 ~ p)
matches[p] = 1
# print # the matching line could be printed
}
END {
for (p in patterns) {
if (matches[p] != 1)
exit 1
}
}
Run it like this:
./multigrep.awk Dansk Norsk Svenska 'Language: .. - A.*c' dvdfile.dat

Here's what worked well for me:
find . -path '*/.svn' -prune -o -type f -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
./another/path/to/file2.txt
./blah/foo.php
If I just wanted to find .sh files with these three, then I could have used:
find . -path '*/.svn' -prune -o -type f -name "*.sh" -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh

Expanding on #kurumi's awk answer, here's a bash function:
all_word_search() {
gawk '
BEGIN {
for (i=ARGC-2; i>=1; i--) {
search_terms[ARGV[i]] = 0;
ARGV[i] = ARGV[i+1];
delete ARGV[i+1];
}
}
{
for (i=1;i<=NF; i++)
if ($i in search_terms)
search_terms[$1] = 1
}
END {
for (word in search_terms)
if (search_terms[word] == 0)
exit 1
}
' "$#"
return $?
}
Usage:
if all_word_search Dansk Norsk Svenska filename; then
echo "all words found"
else
echo "not all words found"
fi

I did that with two steps. Make a list of csv files in one file
With a help of this page comments I made two scriptless steps to get what I needed. Just type into terminal:
$ find /csv/file/dir -name '*.csv' > csv_list.txt
$ grep -q Svenska `cat csv_list.txt` && grep -q Norsk `cat csv_list.txt` && grep -l Dansk `cat csv_list.txt`
it did exactly what I needed - print file names containing all three words.
Also mind the symbols like `' "

If you only need two search terms, arguably the most readable approach is to run each search and intersect the results:
comm -12 <(grep -rl word1 . | sort) <(grep -rl word2 . | sort)

If you have git installed
git grep -l --all-match --no-index -e Dansk -e Norsk -e Svenska
The --no-index searches files in the current directory that is not managed by Git. So this command will work in any directory irrespective of whether it is a git repository or not.

I had this problem today, and all one-liners here failed to me because the files contained spaces in the names.
This is what I came up with that worked:
grep -ril <WORD1> | sed 's/.*/"&"/' | xargs grep -il <WORD2>

A simple one-liner in bash for an arbitrary list LIST for file my_file.txt can be:
LIST="Dansk Norsk Svenska"
EVAL=$(echo "$LIST" | sed 's/[^ ]* */grep -q & my_file.txt \&\& /g'); eval "$EVAL echo yes || echo no"
Replacing eval with echo reveals, that the following command is evaluated:
grep -q Dansk my_file.txt && grep -q Norsk my_file.txt && grep -q Svenska my_file.txt && echo yes || echo no

Related

Remove half the lines of files recursively using sed and wc

I wonder if there is a way to remove half the lines of a file using wc and sed.
I can do this:
sed -i '50,$d' myfile.txt
Which removes lines from 50 to the end of file. I can also do this:
wc -l myfile.txt
Which returns the number of lines in the file.
But what I really want to do is something like this:
wc -l myfile.txt | sed -i '{wc -l result}/2,$d' myfile.txt
How can I tell sed to remove the lines starting from the wc -l result divided by 2?
How can I do this recursively?
you can make use of head command:
head -n -"$(($(wc -l<file)/2))" file
awk is also possible, and with exit statement, it could be faster:
awk -v t="$(wc -l <file)" 'NR>=t/2{exit}7' file
You can get the file by awk/head ... > newFile or cmd > tmp && mv tmp file to have "in-place" change.
I guess you were close. Just use a command substitution with arithmetic expansion to get the value of the starting line:
startline="$(( $(wc -l <myfile.txt) / 2 ))"
sed -i "$startline"',$d' myfile.txt
Or a oneliner:
sed -i "$(( $(wc -l <myfile.txt) / 2 ))"',$d' myfile.txt
In some sense, reading the file twice (or, as noted by Kent, once and a half) is unavoidable. Perhaps it will be slightly more efficient if you use just a single process.
awk 'NR==FNR { total=FNR; next }
FNR>total/2 { exit } 1' myfile.txt myfile.txt
Doing this recursively with Awk is slightly painful. If you have GNU Awk, you can use the -t inline option, but I'm not sure of its semantics when you read the same file twice. Perhaps just fall back to a temporary output file.
find . -type f -exec sh -c "awk 'NR==FNR { total=FNR; next }
FNR>total/2 { exit } 1' {} {} >tmp &&
mv tmp {}" _ \;

Count the number of files in a directory containing two specific string in bash

I have few files in a directory containing below pattern:
Simulator tool completed simulation at 20:07:18 on 09/28/18.
The situation of the simulation: STATUS PASSED
Now I want to count the number of files which contains both of strings completed simulation & STATUS PASSED anywhere in the file.
This command is working to search for one string STATUS PASSED and count file numbers:
find /directory_path/*.txt -type f -exec grep -l "STATUS PASSED" {} + | wc -l
Sed is also giving 0 as a result:
find /directory_path/*.txt -type f -exec sed -e '/STATUS PASSED/!d' -e '/completed simulation/!d' {} + | wc -l
Any help/suggestion will be much appriciated!
find . -type f -exec \
awk '/completed simulation/{x=1} /STATUS PASSED/{y=1} END{if (x&&y) print FILENAME}' {} \; |
wc -l
I'm printing the matching file names in case that's useful in some other context but piping that to wc will fail if the file names contain newlines - if that's the case just print 1 or anything else from awk.
Since find /directory_path/*.txt -type f is the same as just ls /directory_path/*.txt if all of the ".txt"s are files, though, it sounds like all you actually need is (using GNU awk for nextfile):
awk '
FNR==1 { x=y=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y { cnt++; nextfile }
END { print cnt+0 }
' /directory_path/*.txt
or with any awk:
awk '
FNR==1 { x=y=f=0 }
/completed simulation/ { x=1 }
/STATUS PASSED/ { y=1 }
x && y && !f { cnt++; f=1 }
END { print cnt+0 }
' /directory_path/*.txt
Those will work no matter what characters are in your file names.
Using grep and standard utils:
{ grep -Hm1 'completed simulation' /directory_path/*.txt;
grep -Hm1 'STATUS PASSED' /directory_path/*.txt ; } |
sort | uniq -d | wc -l
grep -m1 stops when it finds the first match. This saves time if it's a big file. If the list of matches is large, sort -t: -k1 would be better than sort.
The command find /directory_path/*.txt just lists all txt files in /directory_path/ not including subdirectories of /directory_path
find . -name \*.txt -print0 |
while read -d $'\0' file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$_" &&
echo "$_"
done |
wc -l
If you ensure no special characters in the filenames
find . -name \*.txt |
while read file; do
grep -Fq 'completed simulation' "$file" &&
grep -Fq 'STATUS PASSED' "$file" &&
echo "$file"
done |
wc -l
I don't have AIX to test it, but it should be POSIX compliant.

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

How to print number of occurances of a word in a file in unix

This is my shell script.
Given a directory, and a word, search the directory and print the absolute path of the file that has the maximum occurrences of the word and also print the number of occurrences.
I have written the following script
#!/bin/bash
if [[ -n $(find / -type d -name $1 2> /dev/null) ]]
then
echo "Directory exists"
x=` echo " $(find / -type d -name $1 2> /dev/null)"`
echo "$x"
cd $x
y=$(find . -type f | xargs grep -c $2 | grep -v ":0"| grep -o '[^/]*$' | sort -t: -k2,1 -n -r )
echo "$y"
else
echo "Directory does does not exists"
fi
result: scriptname directoryname word
output: /somedirectory/vtb/wordsearch : 4
/foo/bar: 3
Is there any option to replace xargs grep -c $2 ? Because grep -c prints the count=number of lines which contains the word but i need to print the exact occurrence of a word in the files in a given directory
Using grep's -c count feature:
grep -c "SEARCH" /path/to/files* | sort -r -t : -k 2 | head -n 1
The grep command will output each file in a /path/name:count format, the sort will numerically (-n) sort by the 2nd (-k 2) field as delimited by a colon (-t :) in reverse order (-r). We then use head to keep the first result (-n 1).
Try This:
grep -o -w 'foo' bar.txt | wc -w
OR
grep -o -w 'word' /path/to/file/ | wc -w
grep -Fwor "$word" "$dir" | sed "s/:${word}\$//" | sort | uniq -c | sort -n | tail -1

Grep multiple occurrences given two strings and two integers

im looking for a bash script to count the occurences of a word in a given directory and it's subdirectory's files with this pattern:
^str1{n}str2{m}$
for example:
str1= yo
str2= uf
n= 3
m= 4
the match would be "yoyoyoufufufuf"
but i'm having trouble with grep
that's what i have tried
for file in $(find $dir)
do
if [ -f $file ]; then
echo "<$file>:<`grep '\<\$str1\{$n\}\$str2\{$m\}\>'' $file | wc -l >" >> a.txt
fi
done
should i use find?
#Barmar's comment is useful.
If I understand your question, I think this single grep command should do what you're looking for:
grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
Note the combination of -r and -c causes grep to output zero-counts for non-matching files. You can pipe to grep -v ":0$" to suppress this output if you require:
$ dir=.
$ str1=yo
$ str2=uf
$ n=3
$ m=4
$ cat youf
yoyoyoufufufuf
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
./noyouf:0
./youf:1
./dir/youf:1
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir" | grep -v ":0$"
./youf:1
./dir/youf:1
$
Note also $str1 and $str2 need to be put in parentheses so that the {m} and {n} apply to everything within the parentheses and not just the last character.
Note the escaping of the () and {} as we require double-quotes ", so that the variables are expanded into the grep regular expression.

Resources