(shell) How to remove strings from one file which can be found in another file? - shell

file1.txt
aaaa
bbbb
cccc
dddd
eeee
file2.txt
DDDD
cccc
aaaa
result
bbbb
eeee
If it could be case insensitive it would be even more great!
Thank you!

grep can match patterns read from a file, and print out all lines NOT matching that pattern. Can match case insensitively too, like
grep -vi -f file2.txt file1.txt
Excerpts from the man pages:
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)ns zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is
specified by POSIX.)ions in both the PATTERN and the input files. (-i is
specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is
specified by POSIX.)of matching, to select non-matching lines. (-v is
specified by POSIX.)

From the top of my head, use grep -Fiv -f file2.txt < file1.txt.
-F no regexps (fast)
-i case-insensitive
-v invert results
-f <pattern file> get patterns from file

$ grep -iv -f file2 file1
bbbb
eeee
or you can use awk
awk 'FNR==NR{ a[tolower($1)]=$1; next }
{
s=tolower($1)
f=0
for(i in a){if(i==s){f=1}}
if(!f) { print s }
} ' file2 file1

ghostdog74's awk example can be simplified:
awk '
FNR == NR { omit[tolower($0)]++; next }
tolower($0) in omit {next}
{print}
' file2 file1

For various set operations on files see:
http://www.pixelbeat.org/cmdline.html#sets
In your case the inputs are not sorted, so
you want the difference like:
sort -f file1 file1 file2 | uniq -iu

Related

Partial string search between two files using AWK

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.
file1 contains:
Abc
xyz
123
blah
hh
a,b
file2 contains:
abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c
Original command :
egrep -i -f file1 file2
Original (egrep) command output :
$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c
I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.
Modified command in awk :
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
Modified command (awk) output:
$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah
This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.
Use index like this for a partial literal match:
awk '
NR == FNR {
needles[tolower($0)]
next
}
{
haystack = tolower($0)
for (needle in needles) {
if (index(haystack, needle)) {
print
break
}
}
}' file1 file2
I would be a bit surprised that it's significantly faster than egrep but you can try this:
$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123
Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.
If you have GNU awk you can use its IGNORECASE variable instead of tolower:
$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123
And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:
Given that you can use both regexp and string constants to describe
regular expressions, which should you use? The answer is "regexp
constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that
you have supplied a regexp and store it internally in a form that
makes pattern matching more efficient. When using a string
constant, 'awk' must first convert the string into this internal
form and then perform the pattern matching.
In order to do this you can try:
$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
FNR==1 && NR!=FNR {r=#//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123
(r=#// forces variable r to be of type regexp and sub(//,s,r) does not change this)
Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.

grep inverse read match pattern from two files

I have a file (note that some lines have more than 2 columns, also some lines are 1 space delimited, and some are multiple space delimited, this file is quite large...)
file1.txt:
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: nonsense59s59S
nonsense: someRandomColumn
.....
I have another file that is a subset of file1.txt, this file has two columns only and columns are "1" space delimited!
file2.txt
foo: foo#foo.com
nonsense: nonsense59s59S
now, I would like to delete all lines that appear in file2.txt from file1.txt, how can I do that in a shell script? note that the second file (file2.txt) has two columns only, while file1.txt has multiple... so if a matching needs to be done it should be like: $1(from file2) match $1(from file1) and $NF(from file2) match $NF(from file1) and then inverse the match and print...
P.S. already tried grep -vf file2.txt file1.txt but since the space between column1 and $NF is not fixed it didn't work...
sed and awk should do the trick but can't come up with the code...
sed -i '/^<firstColumnOfFile2> .* <lastColumnOfFile2>$/d' file1.txt (perhaps in a while loop!)
or something like: grep -vw -f ^[(1stColofFile2)] and also [(lastColOfFile2)]$ file1.txt
You can use sed to turn the lines in file2.txt into regular expressions that match one or more spaces after the colon, and then use grep to remove the lines from file1.txt that match those:
$ grep -Evf <(sed 's/^\([^:]*\): /^\1:[[:space:]]+/' file2.txt) file1.txt
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: someRandomColumn
$ awk 'NR==FNR{a[$0]; next} {orig=$0; $1=$1} !($0 in a){print orig}' file2 file1
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: someRandomColumn
.....

Non matching word from file1 to file2

I have two files - file1 & file2.
file1 contains (only words) says-
ABC
YUI
GHJ
I8O
..................
file2 contains many para.
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
...................
I am using below command to get the matching lines which contains word from file1 in file2
grep -Ff file1 file2
(Gives output of lines where words of file1 found in file2)
I also need the words which doesn't match/found in file 2 and unable to find Un-matching word.
Can anyone help in getting below output
YUI
I8O
i am looking one liner command (via grep,awk,sed), as i am using pssh command and can't use while,for loop
You can print only the matched parts with -o.
$ grep -oFf file1 file2
ABC
GHJ
Use that output as a list of patterns for a search in file1. Process substitution <(cmd) simulates a file containing the output of cmd. With -v you can print lines that did not match. If file1 contains two lines such that one line is a substring of another line you may want to add -x (only match whole lines) to prevent false positives.
$ grep -vxFf <(grep -oFf file1 file2) file1
YUI
I8O
Using Perl - both matched/non-matched in same one-liner
$ cat sinw.txt
ABC
YUI
GHJ
I8O
$ cat sin_in.txt
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
$ perl -lne '
BEGIN { %x=map{chomp;$_=>1} qx(cat sinw.txt); $w="\\b".join("\|",keys %x)."\\b"}
print "$&" and delete($x{$&}) if /$w/ ;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
ABC
GHJ
non-matched
I8O
YUI
$
Getting only the non-matched
$ perl -lne '
BEGIN {
%x = map { chomp; $_=>1 } qx(cat sinw.txt);
$w = "\\b" . join("\|",keys %x) . "\\b"
}
delete($x{$&}) if /$w/;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
non-matched
I8O
YUI
$
Note that even a single use of $& variable used to be very expensive for the whole program, in Perl versions prior to 5.20.
Assuming your "words" in file1 are in more than 1 line :
while read line
do
for word in $line
do
if ! grep -q $word file2
then echo $word not found
fi
done
done < file1
For Un-matching words, here's one GNU awk solution:
awk 'NR==FNR{a[$0];next} !($1 in a)' RS='[ \n]' file2 file1
YUI
I8O
Or !($0 in a), it's the same. Since I set RS='[ \n]', every space as line separator too.
And note that I read file2 first, and then file1.
If file2 could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2", or FILENAME==ARGV[1] etc.
Same mechanism for only the matched one too:
awk 'NR==FNR{a[$0];next} $0 in a' RS='[ \n]' file2 file1
ABC
GHJ

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

find difference between two text files with one item per line [duplicate]

This question already has answers here:
How to remove the lines which appear on file B from another file A?
(12 answers)
Closed 6 years ago.
I have two files:
file 1
dsf
sdfsd
dsfsdf
file 2
ljljlj
lkklk
dsf
sdfsd
dsfsdf
I want to display what is in file 2 but not in file 1, so file 3 should look like
ljljlj
lkklk
grep -Fxvf file1 file2
What the flags mean:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
-x, --line-regexp
Select only those matches that exactly match the whole line.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
You can try
grep -f file1 file2
or
grep -v -F -x -f file1 file2
You can use the comm command to compare two sorted files
comm -13 <(sort file1) <(sort file2)
I successfully used
diff "${file1}" "${file2}" | grep "<" | sed 's/^<//g' > "${diff_file}"
Outputting the difference to a file.
if you are expecting them in a certain order, you can just use diff
diff file1 file2 | grep ">"
join -v 2 <(sort file1) <(sort file2)
A tried a slight variation on Luca's answer and it worked for me.
diff file1 file2 | grep ">" | sed 's/^> //g' > diff_file
Note that the searched pattern in sed is a > followed by a space.
file1
m1
m2
m3
file2
m2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} ($0 in file1)' file1 file2
m2
> What's awk command to get 'm1 and m3' ?? as in file1 and not in file2?
m1
m3
If you want to use loops You can try like this: (diff and cmp are much more efficient. )
while read line
do
flag = 0
while read line2
do
if ( "$line" = "$line2" )
then
flag = 1
fi
done < file1
if ( flag -eq 0 )
then
echo $line > file3
fi
done < file2
Note: The program is only to provide a basic insight into what can be done if u dont want to use system calls such as diff n comm..
an awk answer:
awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
With GNU sed:
sed 's#[^^]#[&]#g;s#\^#\\^#g;s#^#/^#;s#$#$/d#' file1 | sed -f- file2
How it works:
The first sed produces an output like this:
/^[d][s][f]$/d
/^[s][d][f][s][d]$/d
/^[d][s][f][s][d][f]$/d
Then it is used as a sed script by the second sed.

Resources