Using grep and pipes in Unix to find specific words - bash

Let's say I'm using grep, and I use use the -v option on a text file to find all the words that do not contain vowels. If I then wanted to see how many words there are in this file that do not contain vowels, what could I do?
I was thinking of using a pipe and using the rc command by itself. Would that work? Thanks.

Actually, I believe that you want wc, not rc, as in:
grep -civ '[aeiouy]' words.txt
For example, consider the file:
$ cat words.txt
the
words
mph
tsk
hmmm
Then, the following correctly counts the three "words" without vowels:
$ grep -civ '[aeiouy]' words
3
I included y in the vowel list. You can decide whether y or not it should be removed.
Also, I assumed above that your file has one word per line.
The grep options used above are as follows:
-v means exclude matching lines
-i makes the matching case-insensitive
-c tells grep to return a count, not the actual matches
Multiple words per line
$ echo the tsk hmmm | grep -io '\b[bcdfghjklmnpqrstvxz]*\b' | wc -l
2
Because \b matches at word boundaries, the above regex matches only words that lack vowels. -o tells grep to print only the matching portion of the line, not the entire. Because -c counts the number of lines with matches, it is not useful here. wc -l is used instead to count matches.

The following script will count the number of words that don't contain vowels (if there are several words per line):
#!/bin/bash
# File can be a script parameter
FILE="$1"
let count=0
while read line; do
for word in $line; do
grep -qv "[aeiou]" <<< "$word"
if [ $? -eq 0 ]; then
let count++
fi
done
done < FILE
echo "words without vowels: $count"
If there is only one word per line, then the following will be enough:
grep -cv "[aeiou]" < file

If multiple words can be on the same line, and you want to count them too, you can use grep -o with wc -l to count all the matches correctly, like so:
$ echo "word work no-match wonder" | grep -o "wo[a-z]*" | wc -l
3

You could, alternatively, do this all within an Awk:
awk '!/[aeiou]/ {n++} END {print n}' file
For lines with multiple fields:
awk '{for(i=1; i<=NF; i++) if($i !~ /[aeiou]/) n++} END {print n}' file

Related

remove duplicate lines with similar prefix

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.
From this,
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/
to this
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
Appreciate any suggestions,
Answer in case reordering the output is allowed.
sort -r file | awk 'a!~"^"$0{a=$0;print}'
sort -r file : sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same pattern
awk 'a!~"^"$0{a=$0;print}' : parse sorted output where a holds the previous line and $0 holds the current line
a!~"^"$0 checks for each line if current line is not a substring at the beginning of the previous line.
if $0 is not a substring (ie. not similar prefix), we print it and save new string in a (to be compared with next line)
The first line $0 is not in a because no value was assigned to a (first line is always printed)
A quick and dirty way of doing it is the following:
$ while read elem; do echo -n "$elem " ; grep $elem file| wc -l; done <file | awk '$2==1{print $1}'
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.
Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:
#!/bin/bash
f=sample.txt # sample data
p='' # previous line = empty
sort -r "$f" | \
while IFS= read -r s || [[ -n "$s" ]]; do # reverse sort, then read string (line)
[[ "$s" = "${p:0:${#s}}" ]] || \
printf "%s\n" "$s" # if s is not prefix of p, then print it
p="$s"
done
Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.
Test:
$ cat sample.txt
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/
123/456/789/
xyz/
$ ./remove-prefix.sh
xyz/
abc/def/ghi/jkl/two/two
abc/def/ghi/jkl/one/one
123/456/789/
Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:
#!/bin/bash
f=sample.txt
p=''
cat -n "$f" | \
sed 's:\t:|:' | \
sort -r -t'|' -k2 | \
while IFS='|' read -r i s || [[ -n "$s" ]]; do
[[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
p="$s"
done | \
sort -n -t'|' -k1 | \
sed 's:^.*|::'
Explanations:
cat -n: numbering all lines
sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
while ... done: similar to solution of step 1
sort -n -t'|' -k1: sort back to original order (numbering sort)
sed 's:^.*|::': remove the numbering
Test:
$ ./remove-prefix.sh
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/789/
xyz/
Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.
In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).
The following awk does what is requested, it reads the file twice.
In the first pass it builds up all possible prefixes per line
The second pass, it checks if the line is a possible prefix, if not print.
The code is:
awk -F'/' '(NR==FNR){s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]};next}
{if (! ($0 in a) ) {print $0}}' <file> <file>
You can also do it with reading the file a single time, but then you store it into memory :
awk -F'/' '{s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]}; b[NR]=$0; next}
END {for(i=1;i<=NR;i++){if (! (b[i] in a) ) {print $0}}}' <file>
Similar to the solution of Allan, but using grep -c :
while read line; do (( $(grep -c $line <file>) == 1 )) && echo $line; done < <file>
Take into account that this construct reads the file (N+1) times where N is the amount of lines.

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
ABC
DEF
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
then
echo ABC >output.txt
else
echo -n
fi
if grep -i -q 'contains def' file.txt
then
echo DEF >>output.txt
else
echo -n
fi
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
ABC
DEF
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
do
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'
done

Using grep to get the line number of first occurrence of a string in a file

I am using bash script for testing purpose.During my testing I have to find the line number of first occurrence of a string in a file. I have tried "awk" and "grep" both, but non of them return the value.
Awk example
#/!bin/bash
....
VAR=searchstring
...
cpLines=$(awk '/$VAR/{print NR}' $MYDIR/Configuration.xml
this does not expand $VAR. If I use the value of VAR it works, but I want to use VAR
Grep example
#/!bin/bash
...
VAR=searchstring
...
cpLines=grep -n -m 1 $VAR $MYDIR/Configuration.xml |cut -f1 -d:
this gives error line 20: -n: command not found
grep -n -m 1 SEARCH_TERM FILE_PATH |sed 's/\([0-9]*\).*/\1/'
grep switches
-n = include line number
-m 1 = match one
sed options (stream editor):
's/X/Y/' - replace X with Y
\([0-9]*\) - regular expression to match digits zero or multiple times occurred, escaped parentheses, the string matched with regex in parentheses will be the \1 argument in the Y (replacement string)
\([0-9]*\).* - .* will match any character occurring zero or multiple times.
You need $() for variable substitution in grep
cpLines=$(grep -n -m 1 $VAR $MYDIR/Configuration.xml |cut -f1 -d: )
Try something like:
awk -v search="$var" '$0~search{print NR; exit}' inputFile
In awk, / / will interpret awk variable literally. You need to use match (~) operator. What we are doing here is looking for the variable against your input line. If it matches, we print the line number stored in NR and exit.
-v allows you to create an awk variable (search) in above example. You then assign it your bash variable ($var).
grep -n -m 1 SEARCH_TERM FILE_PATH | grep -Po '^[0-9]+'
explanation:
-Po = -P -o
-P use perl regex
-o only print matched string (not the whole line)
Try pipping;
grep -P 'SEARCH TERM' fileName.txt | wc -l

How to find files in UNIX which have a multiple-line pattern?

I'm trying to search all files for a pattern that spans multiple lines, and then return a list of file names that match the pattern.
I'm using this line:
find . -name "$file_to_check" 2>/dir1/null | xargs grep "$2" >> $grep_out
This will create a list of files and the line the matched pattern is found on within $grep_out. The problem with this is that the search doesn't span multiple lines. I've read that grep cannot span multiple lines, so I'm looking to replace grep with sed or awk.
The only thing I think that needs to be changed is the grep. I've found that grep can't search for a pattern across multiple lines, so I'm looking to use sed or awk. When I use these commands from the terminal, I get a large printout of the file matching the pattern I've given sed. All I want is the filename, not the context of the pattern. Is there a way to retrieve this - perhaps have sed print out the filename rather than the context? Or, have sed return true/false when it finds a match, and then I can save the current filename that was used to do the search.
Most text processing tools are line-oriented by default. If we choose to read records as paragraphs, using blank lines as record separators:
awk -v RS= -v pattern="$2" '$0 ~ pattern {print FILENAME; exit}' file
or
find . -options ... -print0 | xargs -0 awk -v RS= -v pattern="$2" '$0 ~ pattern {print FILENAME; exit}'
I'm assuming your pattern does not contain consecutive newlines (i.e. blank lines)
To check if a file contains "word1[anything]word2[anything]word3"
brute force: read the entire file and then to a regex comparison: with bash
contents=$(< "$file")
if [[ $contents =~ "$word1".*"$word2".*"$word3" ]]; then
echo "match"
else
echo "no match"
fi
2. line-by-line with awk, use a state machine
awk -v w1="$word1" -v w2="$word2" -v w3="$word3" '
$0 ~ w1 {have_w1 = 1}
have_w1 && $0 ~ w2 {have_w2 = 1}
have_w2 && $0 ~ w3 {have_w3 = 1; exit}
END {exit (! have_w3)}
' filename
Ah, strike #2: that would match the line "word3word2word1" -- does not enforce order of the words
I'm trying to search all files for a pattern that spans multiple lines, and then return a list of file names that match the pattern.
pattern=$( echo "whatever your search pattern is" | tr '\n' ' ' )
for FILE in *
do
tr '\n' ' ' <"$FILE" | if grep "$pattern" then; echo $FILE; fi
done
Just replace the newlines for spaces both in your pattern and your grep-input
With 'find' , you could do it like this:
#!/bin/bash
find . -name "$file_to_check" 2>/dir1/null | while read FILE
do
tr '\n' ' ' <"$FILE" | if grep -q "word1.*word2.*word3" ; then echo "$FILE" ; fi
done >grep_out
As for the search pattern: ".*" means "any amount of any character"
Remember that a searchpattern in grep always wants to have certain characters escaped like "." becomes "\." and "^" becomes "\^"

Is it possible to do a grep with keywords stored in the array?

Is it possible to do a grep with keywords stored in the array.
Here is the possible code snippet; how can I correct it?
args=("key1" "key2" "key3")
cat file_name |while read line
echo $line | grep -q -w ${args[c]}
done
At the moment, I can search for only one keyword. I would like to search for all the keywords which is stored in args array.
args=("key1" "key2" "key3")
pat=$(echo ${args[#]}|tr " " "|")
grep -Eow "$pat" file
Or with the shell
args=("key1" "key2" "key3")
while read -r line
do
for i in ${args[#]}
do
case "$line" in
*"$i"*) echo "found: $line";;
esac
done
done <"file"
You can use some bash expansion magic to prefix each element with -e and pass each element of the array as a separate pattern. This may avoid some precedence issues where your patterns may interact badly with the | operator:
$ grep ${args[#]/#/-e } file_name
The downside to this is that you cannot have any spaces in your patterns because that will split the arguments to grep. You cannot put quotes around the above expansion, otherwise you get "-e pattern" as a single argument to grep.
This is one way:
args=("key1" "key2" "key3")
keys=${args[#]/%/\\|} # result: key1\| key2\| key3\|
keys=${keys// } # result: key1\|key2\|key3\|
grep "${keys}" file_name
Edit:
Based on Pavel Shved's suggestion:
( IFS="|"; keys="${args[*]}"; keys="${keys//|/\\|}"; grep "${keys}" file_name )
The first version as a one-liner:
keys=${args[#]/%/\\|}; keys=${keys// }; grep "${keys}" file_name
Edit2:
Even better than the version using IFS:
printf -v keys "%s\\|" "${args[#]}"; grep "${keys}" file_name
I tend to use process substitution for everything. It's convenient when combined with grep's -f option:
Obtain patterns from FILE, one per line.
(Depending on the context, you might even want to combine that with -F, -x or -w, etc., for awesome effects.)
So:
#! /usr/bin/env bash
t=(8 12 24)
seq 30 | grep -f <(printf '%s\n' "${t[#]}")
and I get:
8
12
18
24
28
I basically write a pseudo-file with one item of the array per line, and then tell grep to use each of these lines as a pattern.
The command
( IFS="|" ; grep --perl-regexp "${args[*]}" ) <file_name
searches the file for each keyword in an array. It does so by constructing regular expression word1|word2|word3 that matches any word from the alternatives given (in perl mode).
If I there is a way to join array elements into a string, delimiting them with sequence of characters (namely, \|), it could be done without perl regexp.
perhaps something like this;
cat file_name |while read line
for arg in ${args[#]}
do
echo $line | grep -q -w $arg}
done
done
not tested!

Resources