Using grep -f to find the pattern themselves that have matches - bash

I'm trying to give grep a pattern file (through -f) , but I want to learn which patterns are matching something in the search file
For example, given 1.txt:
a/(.*)
b/(.*)
b/c/(.*)
b/foo/(.*)
d/(.*)
e/(.*)
and 2.txt:
a/
a/foo/bar/
b/foo/
d/foo/
The patterns from 1.txt that match something in 2.txt are (omitting the (.*) suffix) are as follows:
a/
b/
b/foo/
d/
How can I "find the list of patterns that have a match"?
EDIT: I'm only looking for a prefix match but I think the question is interesting enough for general pattern matching.
EDIT: Since a for-loop based solution is given, I should say I'm not looking at calling grep 10000 times. :) The working solution I already have (listed below) is pretty slow:
for line in "${file1_arr[#]}"; do
if ! grep -qE "^$v(.*)\$"; then
echo "$line"
fi
done
Ideally I'm looking for a single grep call or so with less overhead.

In awk:
$ awk 'NR==FNR{a[$0]=FNR;next}{for(i in a)if($0 ~ i)print i,$0}' 1.txt 2.txt
a/(.*) a/
a/(.*) a/foo/bar
b/(.*) b/foo
d/(.*) d/foo
Explained:
$ awk ' # yes
NR==FNR { # process first file
a[$0]=FNR # hash regex, store record number just in case
next # process next record
}
{ # process second file
for(i in a) # loop every entry in 1.txt
if($0 ~ i) # if regex matches record
print i,$0} # print all matching regex and record
' 1.txt 2.txt
Edit: To output each regex just once (like shown here in the expected output) you could delete the regex from a once it's been used, that way it won't get matched and outputed more than once:
$ awk '
NR==FNR { a[$0]; next }
{
for(i in a)
if($0 ~ i) {
print i
delete a[i] # deleted regex wont get matched again
}
}' 1.txt 2.txt
vendor/cloud.google.com/go/compute/metadata/(.*)$
vendor/cloud.google.com/go/compute/(.*)$
vendor/cloud.google.com/go/(.*)$
vendor/cloud.google.com/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/dns/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/(.*)$
vendor/github.com/Azure/(.*)$
vendor/github.com/(.*)$
Also, My test showed about 60 % off (mini laptop, 1:16 to 29 s) the time with this modification for GNU awk (using data you provided in the comments, file1.txt and file2.txt):
$ awk '
BEGIN {
FS="." # . splits the url
}
NR==FNR { a[$1][$0]; next } # we index on the first part of url
{
for(i in a[$1]) # search space decreased
if($0 ~ i) {
print i
delete a[$1][i]
}
}' file1.txt file2.txt
The speedup decreases the search space by using the start of the strings up to the first period as the key for the hash, ie:
FS="." # split at first .
...
a[vendor/github][vendor/github.com/Azure/(.*)$] # example of a hash
...
for(i in a[$1]) # search space decreased
Now it does not have to search the whole hash for a matching regex. More feasibe would probably be to use FS="/" ; a[$1 FS $2] but this was just a quick test.

The following script:
#!/usr/bin/env bash
lines=$(wc -l < 1.txt)
for (( i=1; i<=$lines; i++ )); do
line=$(sed -n "$i"p 1.txt)
line=$(sed "s/\/(.*)$//" <<< "$line")
grep -E "$line" 2.txt 1>/dev/null && echo "$line"
done
prints lines in 1.txt that matched in 2.txt:
a
b
b/foo
d
comments:
# gets a single line from 1.txt
line=$(sed -n "$i"p 1.txt)
# removes trailing pattern /(.*) from $line variable
line=$(sed "s/\/(.*)$//" <<< "$line")
# if $line matches in 2.txt, print $line
grep -E "$line" 2.txt 1>/dev/null && echo "$line"

I tried the awk and sed based solutions, and I realized I can do this much faster using bash's builtin regexp engine if I read both files in memory.
Here's basically it.
text="$(cat 2.txt)" # read 2.txt
while read -r line; do # for each 'line' from 1.txt
re=[^\b]*${line} # prepend ^ or \b to the pattern
if [[ "$text" =~ $re ]]; then # match the pattern to 2.txt
echo "${line}" # if there's a match, print the pattern
fi
done < <(cat "1.txt")
Since this doesn't spawn any extra processes and just does it in-memory, I suspect this is quite efficient. My benchmarks with the files I linked under James' answer shows 8-9 seconds for this.

I don't see a solution with grep, but sed is an alternative to awk.
With sed I would like to see patterns like b/foo/.* in 1.txt, but I will show a solution based on the (.*).
The purpose of the first command is constructing sed constructions, that will replace the inputline with the regular expression, when it matches the regular expression. The different output lines must look like
sed -rn 's#b/c/(.*)#b/c/#p' 2.txt
and this can be done with
# Use subprocess
sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt
# resulting in
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt) 2.txt| sort -u
The solution is a bit difficult to read, caused bij the layout of 1.txt, where I would want lines like b/foo/.*.
The above commands will have 2 bugs:
When the match is on a part of the line, the non-matched part will be shown in the output. This can be fixed with matching the garbage
# Use lines like 's#.*b/foo(.*)#b/foo#p'
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\1#p/' 1.txt) 2.txt| sort -u
The second bug is that strings in 2.txt that have two matches, will be matched only once (the first match will edit the line in the stream).
This can be fixed by adding some unique marker (I will use \a) for the matching lines and repeating the inputlines on the output (with \n&).
The output can be viewed by looking for the \a markers.
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\\a\1\\n\&#p/' 1.txt) 2.txt|
sed -rn '/\a/ s/.(.*)/\1/p' | sort -u
EDIT:
The work-around with a marker and restoring the original input is not needed when you follow a different approach.
In sed you can print something to stdout without changing the stream.
One possibility (slow for this situation) is using
sed '/something/ eecho "something" '
Another possibility is using the "x" command (that eXchanges the pattern space with the hold buffer). You actuallu want to have a sedscript with commands like
\%a/% {h;s%.*%a/%p;x}
\%b/% {h;s%.*%b/%p;x}
\%b/c/% {h;s%.*%b/c/%p;x}
\%b/foo/% {h;s%.*%b/foo/%p;x}
\%d/% {h;s%.*%d/%p;x}
\%e/% {h;s%.*%e/%p;x}
Using above method the sed solution simplifies into
sed -nf <(
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt
) 2.txt | sort -u
When the file 1.txt is not changed often, you might want to preprocess that file.
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt > /tmp/sed.in
sed -nf /tmp/sed.in 2.txt | sort -u

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Search file A for a list of strings located in file B and append the value associated with that string to the end of the line in file A

This is a bit complicated, well I think it is..
I have two files, File A and file B
File A contains delay information for a pin and is in the following format
AD22 15484
AB22 9485
AD23 10945
File B contains a component declaration that needs this information added to it and is in the format:
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
So what I am trying to achieve is the following output
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='15484';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='10945';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='9485';
There is no order to the pin numbers in file A or B
So I'm assuming the following needs to happen
open file A, read first line
search file B for first string field in the line just read
once found in file B at the end of the line add the text "\nPIN_DELAY='"
add the second string filed of the line read from file A
add the following text at the end "';"
repeat by opening file A, read the second line
I'm assuming it will be a combination of sed and awk commands and I'm currently trying to work it out but think this is beyond my knowledge. Many thanks in advance as I know it's complicated..
FILE2=`cat file2`
FILE1=`cat file1`
TMPFILE=`mktemp XXXXXXXX.tmp`
FLAG=0
for line in $FILE1;do
echo $line >> $TMPFILE
for line2 in $FILE2;do
if [ $FLAG == 1 ];then
echo -e "PIN_DELAY='$(echo $line2 | awk -F " " '{print $1}')'" >> $TMPFILE
FLAG=0
elif [ "`echo $line | grep $(echo $line2 | awk -F " " '{print $1}')`" != "" ];then
FLAG=1
fi
done
done
mv $TMPFILE file1
Works for me, you can also add a trap for remove tmp file if user send sigint.
awk to the rescue...
$ awk -vq="'" 'NR==FNR{a[$1]=$2;next} {print; for(k in a) if(match($0,k)) {print "PIN_DELAY=" q a[k] q ";"; next}}' keys data
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='15484';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='10945';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='9485';
Explanation: scan the first file for key/value pairs. For each line in the second data file print the line, for any matching key print value of the key in the requested format. Single quotes in awk is little tricky, setting a q variable is one way of handling it.
FINAL Script for my application, A big thank you to all that helped..
# ! /usr/bin/sh
# script created by Adam with a LOT of help from users on stackoverflow
# must pass $1 file (package file from Xilinx)
# must pass $2 file (chips.prt file from the PCB design office)
# remove these temp files, throws error if not present tho, whoops!!
rm DELAYS.txt CHIP.txt OUTPUT.txt
# BELOW::create temp files for the code thanks to Glastis#stackoverflow https://stackoverflow.com/users/5101968/glastis I now know how to do this
DELAYS=`mktemp DELAYS.txt`
CHIP=`mktemp CHIP.txt`
OUTPUT=`mktemp OUTPUT.txt`
# BELOW::grep input file 1 (pkg file from Xilinx) for lines containing a delay in the form of n.n and use TAIL to remove something (can't remember), sed to remove blanks and replace with single space, sed to remove space before \n, use awk to print columns 3,9,10 and feed into awk again to calculate delay provided by fedorqui#stackoverflow https://stackoverflow.com/users/1983854/fedorqui
# In awk, NF refers to the number of fields on the current line. Since $n refers to the field number n, with $(NF-1) we refer to the penultimate field.
# {...}1 do stuff and then print the resulting line. 1 evaluates as True and anything True triggers awk to perform its default action, which is to print the current line.
# $(NF-1) + $NF)/2 * 141 perform the calculation: `(penultimate + last) / 2 * 141
# {$(NF-1)=sprintf( ... ) assign the result of the previous calculation to the penultimate field. Using sprintf with %.0f we make sure the rounding is performed, as described above.
# {...; NF--} once the calculation is done, we have its result in the penultimate field. To remove the last column, we just say "hey, decrease the number of fields" so that the last one gets "removed".
grep -E -0 '[0-9]\.[0-9]' $1 | tail -n +2 | sed -e 's/[[:blank:]]\+/ /g' -e 's/\s\n/\n/g' | awk '{print ","$3",",$9,$10}' | awk '{$(NF-1)=sprintf("%.0f", ($(NF-1) + $NF)/2 * 169); NF--}1' >> $DELAYS
# remove blanks in part file and add additional commas (,) so that the following awk command works properly
cat $2 | sed -e "s/[[:blank:]]\+//" -e "s/(/(,/g" -e 's/)/,)/g' >> $CHIP
# this awk command is provided by karakfa#stackoverflow https://stackoverflow.com/users/1435869/karakfa Explanation: scan the first file for key/value pairs. For each line in the second data file print the line, for any matching key print value of the key in the requested format. Single quotes in awk is little tricky, setting a q variable is one way of handling it. https://stackoverflow.com/questions/32458680/search-file-a-for-a-list-of-strings-located-in-file-b-and-append-the-value-assoc
awk -vq="'" 'NR==FNR{a[$1]=$2;next} {print; for(k in a) if(match($0,k)) {print "PIN_DELAY=" q a[k] q ";"; next}}' $DELAYS $CHIP >> $OUTPUT
# remove the additional commas (,) added in earlier before ) and after ( and you are done..
cat $OUTPUT | sed -e 's/(,/(/g' -e 's/,)/)/g' >> chipsd.prt

How can I find unique characters per line of input?

Is there any way to extract the unique characters of each line?
I know I can find the unique lines of a file using
sort -u file
I would like to determine the unique characters of each line (something like sort -u for each line).
To clarify: given this input:
111223234213
111111111111
123123123213
121212122212
I would like to get this output:
1234
1
123
12
Using sed
sed ':;s/\(.\)\(.*\)\1/\1\2/;t' file
Basically what it does is capture a character and check if it appears anywhere else on the line. It also captures all the characters between these.
Then it replaces all of that including the second occurence with just first occurence and then what was inbetween.
t is test and jumps to the : label if the previous command was successful. Then this repeats until the s/// command fails meaning only unique characters remain.
; just separates commands.
1234
1
123
12
Keeps order as well.
It doesn't get things in the original order, but this awk one-liner seems to work:
awk '{for(i=1;i<=length($0);i++){a[substr($0,i,1)]=1} for(i in a){printf("%s",i)} print "";delete a}' input.txt
Split apart for easier reading, it could be stand-alone like this:
#!/usr/bin/awk -f
{
# Step through the line, assigning each character as a key.
# Repeated keys overwrite each other.
for(i=1;i<=length($0);i++) {
a[substr($0,i,1)]=1;
}
# Print items in the array.
for(i in a) {
printf("%s",i);
}
# Print a newline after we've gone through our items.
print "";
# Get ready for the next line.
delete a;
}
Of course, the same concept can be implemented pretty easily in pure bash as well:
#!/usr/bin/env bash
while read s; do
declare -A a
while [ -n "$s" ]; do
a[${s:0:1}]=1
s=${s:1}
done
printf "%s" "${!a[#]}"
echo ""
unset a
done < input.txt
Note that this depends on bash 4, due to the associative array. And this one does get things in the original order, because bash does a better job of keeping array keys in order than awk.
And I think you've got a solution using sed from Jose, though it has a bunch of extra pipe-fitting involved. :)
The last tool you mentioned was grep. I'm pretty sure you can't do this in traditional grep, but perhaps some brave soul might be able to construct a perl-regexp variant (i.e. grep -P) using -o and lookarounds. They'd need more coffee than is in me right now though.
One way using perl:
perl -F -lane 'print do { my %seen; grep { !$seen{$_}++ } #F }' file
Results:
1234
1
123
12
Another solution,
while read line; do
grep -o . <<< $line | sort -u | paste -s -d '\0' -;
done < file
grep -o . convert 'row line' to 'column line'
sort -u sort letters and remove repetead letters
paste -s -d '\0' - convert 'column line' to 'row line'
- as a filename argument to paste to tell it to use standard input.
This awk should work:
awk -F '' '{delete a; for(i=1; i<=NF; i++) a[$i]; for (j in a) printf "%s", j; print ""}' file
1234
1
123
12
Here:
-F '' will break the record char by char giving us single character in $1, $2 etc.
Note: For non-gnu awk use:
awk 'BEGIN{FS=""} {delete a; for(i=1; i<=NF; i++) a[$i];
for (j in a) printf "%s", j; print ""}' file
This might work for you (GNU sed):
sed 's/\B/\n/g;s/.*/echo "&"|sort -u/e;s/\n//g' file
Split each line into a series of lines. Unique sort those lines. Combine the result back into a single line.
Unique and sorted alternative to the others, using sed and gnu tools:
sed 's/\(.\)/\1\n/g' file | sort | uniq
which produces one character per line; If you want those on one line, just do:
sed 's/\(.\)/\1\n/g' file | sort | uniq | sed ':a;N;$!ba;s/\n//g;'
This has the advantage of showing the characters in sorted order, rather than order of appearance.

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Count how many times each word from a word list appears in a file?

I have a file, list.txt which contains a list of words. I want to check how many times each word appears in another file, file1.txt, then output the results. A simple output of all of the numbers sufficient, as I can manually add them to list.txt with a spreadsheet program, but if the script adds the numbers at the end of each line in list.txt, that is even better, e.g.:
bear 3
fish 15
I have tried this, but it does not work:
cat list.txt | grep -c file1.txt
You can do this in a loop that reads a single word at a time from a word-list file, and then counts the instances in a data file. For example:
while read; do
echo -n "$REPLY "
fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)
The "secret sauce" consists of:
using the implicit REPLY variable;
using process substitution to collect words from the word-list file; and
ensuring that you are grepping for whole words in the data file.
This awk method only has to pass through each file once:
awk '
# read the words in list.txt
NR == FNR {count[$1]=0; next}
# process file1.txt
{
for (i=0; i<=NF; i++)
if ($i in count)
count[$i]++
}
# output the results
END {
for (word in count)
print word, count[word]
}
' list.txt file1.txt
This might work for you (GNU sed):
tr -s ' ' '\n' file1.txt |
sort |
uniq -c |
sed -e '1i\s|.*|& 0|' -e 's/\s*\(\S*\)\s\(\S*\)\s*/s|\\<\2\\>.*|\2 \1|/' |
sed -f - list.txt
Explanation:
Split file1.txt into words
Sort the words
Count the words
Create a sed script to match the words (initially zero out each word)
Run the above script against the list.txt
single line command
cat file1.txt |tr " " "\n"|sort|uniq -c |sort -n -r -k 1 |grep -w -f list.txt
The last part of the command tells grep to read words to match from list (-f option) and then match whole words(-w) i.e. if list.txt contains contains car, grep should ignore carriage.
However keep in mind that your view of whole word and grep's view might differ. for eg. although car will not match with carriage, it will match with car-wash , notice that "-" will be considered for word boundary. grep takes anything except letters,numbers and underscores as word boundary. Which should not be a problem as this conforms to the accepted definition of a word in English language.

Resources