Partial string search between two files using AWK - bash

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.
file1 contains:
Abc
xyz
123
blah
hh
a,b
file2 contains:
abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c
Original command :
egrep -i -f file1 file2
Original (egrep) command output :
$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c
I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.
Modified command in awk :
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
Modified command (awk) output:
$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah
This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.

Use index like this for a partial literal match:
awk '
NR == FNR {
needles[tolower($0)]
next
}
{
haystack = tolower($0)
for (needle in needles) {
if (index(haystack, needle)) {
print
break
}
}
}' file1 file2

I would be a bit surprised that it's significantly faster than egrep but you can try this:
$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123
Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.
If you have GNU awk you can use its IGNORECASE variable instead of tolower:
$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123
And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:
Given that you can use both regexp and string constants to describe
regular expressions, which should you use? The answer is "regexp
constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that
you have supplied a regexp and store it internally in a form that
makes pattern matching more efficient. When using a string
constant, 'awk' must first convert the string into this internal
form and then perform the pattern matching.
In order to do this you can try:
$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
FNR==1 && NR!=FNR {r=#//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123
(r=#// forces variable r to be of type regexp and sub(//,s,r) does not change this)
Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.

Related

Bash, get substring by keeping the match with awk

How can I split a string with awk but printing the match too?
Full random string:
aaa sasawf wewfTotemeswdwqewqwqtotemwewedew
I need to get "wewftotemeswdwqewqwqtotemwewedew" where the substring is random, the only constant is a space and the word totem in it. As you notice the random string might contain more than one totem word, I need awk to get the substring starting from the first match. To be clear, I need "wewftotemeswdwqewqwqtotemwewedew" not "totemwewedew". I also need it to be case insensitive
I can use awk -F ' .*totem' '{print$2}' to print eswdwqewqwqtotemwewedew but how can I print the match too?
With GNU awk for the third arg to match():
$ echo 'aaa sasawf wewftotemeswdwqewqwq' |
awk 'match($0,/[^ ]*totem[^ ]*/,a) { print a[0] }'
wewftotemeswdwqewqwq
and with any awk:
$ echo 'aaa sasawf wewftotemeswdwqewqwq' |
awk 'match($0,/[^ ]*totem[^ ]*/) { print substr($0,RSTART,RLENGTH) }'
wewftotemeswdwqewqwq
For case-insensitive matching with GNU awk:
awk -v IGNORECASE=1 'match($0,/[^ ]*totem[^ ]*/...
and with any awk:
awk 'match(tolower($0),/[^ ]*totem[^ ]*/...

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Non matching word from file1 to file2

I have two files - file1 & file2.
file1 contains (only words) says-
ABC
YUI
GHJ
I8O
..................
file2 contains many para.
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
...................
I am using below command to get the matching lines which contains word from file1 in file2
grep -Ff file1 file2
(Gives output of lines where words of file1 found in file2)
I also need the words which doesn't match/found in file 2 and unable to find Un-matching word.
Can anyone help in getting below output
YUI
I8O
i am looking one liner command (via grep,awk,sed), as i am using pssh command and can't use while,for loop
You can print only the matched parts with -o.
$ grep -oFf file1 file2
ABC
GHJ
Use that output as a list of patterns for a search in file1. Process substitution <(cmd) simulates a file containing the output of cmd. With -v you can print lines that did not match. If file1 contains two lines such that one line is a substring of another line you may want to add -x (only match whole lines) to prevent false positives.
$ grep -vxFf <(grep -oFf file1 file2) file1
YUI
I8O
Using Perl - both matched/non-matched in same one-liner
$ cat sinw.txt
ABC
YUI
GHJ
I8O
$ cat sin_in.txt
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
$ perl -lne '
BEGIN { %x=map{chomp;$_=>1} qx(cat sinw.txt); $w="\\b".join("\|",keys %x)."\\b"}
print "$&" and delete($x{$&}) if /$w/ ;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
ABC
GHJ
non-matched
I8O
YUI
$
Getting only the non-matched
$ perl -lne '
BEGIN {
%x = map { chomp; $_=>1 } qx(cat sinw.txt);
$w = "\\b" . join("\|",keys %x) . "\\b"
}
delete($x{$&}) if /$w/;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
non-matched
I8O
YUI
$
Note that even a single use of $& variable used to be very expensive for the whole program, in Perl versions prior to 5.20.
Assuming your "words" in file1 are in more than 1 line :
while read line
do
for word in $line
do
if ! grep -q $word file2
then echo $word not found
fi
done
done < file1
For Un-matching words, here's one GNU awk solution:
awk 'NR==FNR{a[$0];next} !($1 in a)' RS='[ \n]' file2 file1
YUI
I8O
Or !($0 in a), it's the same. Since I set RS='[ \n]', every space as line separator too.
And note that I read file2 first, and then file1.
If file2 could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2", or FILENAME==ARGV[1] etc.
Same mechanism for only the matched one too:
awk 'NR==FNR{a[$0];next} $0 in a' RS='[ \n]' file2 file1
ABC
GHJ

ksh shell script to print and delete matched line based on a string

I have 2 files like below. I need a script to find string from file2 in file1 and delete the line which contains the string from file1 and put it in another file (output1.txt). Also it shld print the lines deleted and the string if the string doesn't exist in File1 (Ouput2.txt).
File1:
Apple
Boy: Goes to school
Cat
File2:
Boy
Dog
I need output like below.
Output1.txt:
Apple
Cat
Output2.txt:
Dog
Can anyone help please
If you have awk available on your system:
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File2 File1 > Output1.txt
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File1 File2 > Output2.txt
The script is storing in an array a the first element $1 of the first file given in argument.
If the first parameter of the second file is not part of the array, print it.
Note that the delimiter is either a space or a :

(shell) How to remove strings from one file which can be found in another file?

file1.txt
aaaa
bbbb
cccc
dddd
eeee
file2.txt
DDDD
cccc
aaaa
result
bbbb
eeee
If it could be case insensitive it would be even more great!
Thank you!
grep can match patterns read from a file, and print out all lines NOT matching that pattern. Can match case insensitively too, like
grep -vi -f file2.txt file1.txt
Excerpts from the man pages:
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)ns zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is
specified by POSIX.)ions in both the PATTERN and the input files. (-i is
specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is
specified by POSIX.)of matching, to select non-matching lines. (-v is
specified by POSIX.)
From the top of my head, use grep -Fiv -f file2.txt < file1.txt.
-F no regexps (fast)
-i case-insensitive
-v invert results
-f <pattern file> get patterns from file
$ grep -iv -f file2 file1
bbbb
eeee
or you can use awk
awk 'FNR==NR{ a[tolower($1)]=$1; next }
{
s=tolower($1)
f=0
for(i in a){if(i==s){f=1}}
if(!f) { print s }
} ' file2 file1
ghostdog74's awk example can be simplified:
awk '
FNR == NR { omit[tolower($0)]++; next }
tolower($0) in omit {next}
{print}
' file2 file1
For various set operations on files see:
http://www.pixelbeat.org/cmdline.html#sets
In your case the inputs are not sorted, so
you want the difference like:
sort -f file1 file1 file2 | uniq -iu

Resources