grep inverse read match pattern from two files - bash

I have a file (note that some lines have more than 2 columns, also some lines are 1 space delimited, and some are multiple space delimited, this file is quite large...)
file1.txt:
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: nonsense59s59S
nonsense: someRandomColumn
.....
I have another file that is a subset of file1.txt, this file has two columns only and columns are "1" space delimited!
file2.txt
foo: foo#foo.com
nonsense: nonsense59s59S
now, I would like to delete all lines that appear in file2.txt from file1.txt, how can I do that in a shell script? note that the second file (file2.txt) has two columns only, while file1.txt has multiple... so if a matching needs to be done it should be like: $1(from file2) match $1(from file1) and $NF(from file2) match $NF(from file1) and then inverse the match and print...
P.S. already tried grep -vf file2.txt file1.txt but since the space between column1 and $NF is not fixed it didn't work...
sed and awk should do the trick but can't come up with the code...
sed -i '/^<firstColumnOfFile2> .* <lastColumnOfFile2>$/d' file1.txt (perhaps in a while loop!)
or something like: grep -vw -f ^[(1stColofFile2)] and also [(lastColOfFile2)]$ file1.txt

You can use sed to turn the lines in file2.txt into regular expressions that match one or more spaces after the colon, and then use grep to remove the lines from file1.txt that match those:
$ grep -Evf <(sed 's/^\([^:]*\): /^\1:[[:space:]]+/' file2.txt) file1.txt
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: someRandomColumn

$ awk 'NR==FNR{a[$0]; next} {orig=$0; $1=$1} !($0 in a){print orig}' file2 file1
there is a line here that has more than two columns
## this line is a comment
blahblah: blahblahSierraexample7272
foo: foo#foobar.com
nonsense: someRandomColumn
.....

Related

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

bash: using 2 variables from same file and sed

I have a 2 files:
file1.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T 45000079
rs111285978:45000103:A:AT 45000103
rs190363568:45000168:C:T 45000168
file2.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T rs142159069
rs111285978:45000103:A:AT rs111285978
rs190363568:45000168:C:T rs190363568
Using file2.txt, I want to replace the names (column2 of file1.txt which is column1 of file2.txt) by the entry in column 2. The output file would then be:
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
I have tried inputing the columns of file2.txt but without success:
while read -r a b
do
cat file1.txt | sed s'/$a/$b/'
done < file2.txt
I am quite new to bash. Also, not sure how to write an output file with my command. Any help would be deeply appreciated.
In your case, using awk or perl would be easier, if you are willing to accept an answer without sed:
awk '(NR==FNR){out[$1]=$2;next}{out[$1]=out[$1]" "$2}END{for (i in out){print out[i]} }' file2.txt file1.txt > output.txt
output.txt :
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
Note: this assume all symbols in column1 are unique, and that they are all present in both files
explanation:
(NR==FNR){out[$1]=$2;next} : while you are parsing the first file, create a map with the name from the first column as key
{out[$1]=out[$1]" "$2} : append the value from the second column
END{for (i in out){print out[i]} } : print all the values in the map
Apparently $2 of file2 is part of $1 of file1, so you could use awk and redefine FS:
$ awk -F"[: ]" '{print $1,$NF}' file1
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168

join two file based on column when there is no one by one corespondness in bash script (awk, grep , sed)

file1.txt
112|9305|/inst.exe
112|9305|/lkj.exe
112|9305|/dje.jar
112|9305|/ind.pdf
112|9306|/ma.exe
112|9306|/ngg.pdf
112|9307|/jhhh.dat
112|9312|/ee.dat
112|9312|/qwq.dll
file2.txt
117|9305|www.gahan.com
117|9306|www.google.com
117|9312|www.mihan.com
117|9307|translate.com
expected output
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll
I want to add third column of file2.txt to third column of file1.txt based on second column values. In fact I want join them based on second column but there is no one bye one correspondence between them. How can I do these with awk or grep or sed in shell script.
You can use awk like this:
awk 'BEGIN{FS=OFS="|"} FNR==NR{a[$2]=$3; next} $2 in a{$3=a[$2] $3} 1' file2.txt file1.txt
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll

(shell) How to remove strings from one file which can be found in another file?

file1.txt
aaaa
bbbb
cccc
dddd
eeee
file2.txt
DDDD
cccc
aaaa
result
bbbb
eeee
If it could be case insensitive it would be even more great!
Thank you!
grep can match patterns read from a file, and print out all lines NOT matching that pattern. Can match case insensitively too, like
grep -vi -f file2.txt file1.txt
Excerpts from the man pages:
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)ns zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is
specified by POSIX.)ions in both the PATTERN and the input files. (-i is
specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is
specified by POSIX.)of matching, to select non-matching lines. (-v is
specified by POSIX.)
From the top of my head, use grep -Fiv -f file2.txt < file1.txt.
-F no regexps (fast)
-i case-insensitive
-v invert results
-f <pattern file> get patterns from file
$ grep -iv -f file2 file1
bbbb
eeee
or you can use awk
awk 'FNR==NR{ a[tolower($1)]=$1; next }
{
s=tolower($1)
f=0
for(i in a){if(i==s){f=1}}
if(!f) { print s }
} ' file2 file1
ghostdog74's awk example can be simplified:
awk '
FNR == NR { omit[tolower($0)]++; next }
tolower($0) in omit {next}
{print}
' file2 file1
For various set operations on files see:
http://www.pixelbeat.org/cmdline.html#sets
In your case the inputs are not sorted, so
you want the difference like:
sort -f file1 file1 file2 | uniq -iu

Resources