Awk output unmatched rows - bash

I have awk command to output matched rows data comparing on 2 columns,
although I would like to output the opposite, unmatched data.
#file1.csv
box1,apple
box2,banana
#file2.csv
data24,box1,apple,text
date25,box1,banana,text
And by AWK I have,
awk -F',' 'NR==FNR{a[$1,$2]; next} ($2,$3) in a' file1.csv file2.csv
The output looks like:
data24,box1,apple,text
And would like to have:
banana,box2
Simple negation seems does not work in this case, do you have any ideas please?
Have tried :
awk -F',' 'NR==FNR{a[$1,$2]=1; next} !($2,$1) in a' file1.csv file2.csv
Which will output:
data24,box1,apple,text
date25,box1,banana,text

$ awk 'BEGIN{FS=OFS=","} NR==FNR{a[$2,$3]; next} !(($1,$2) in a){print $2, $1}' file2.csv file1.csv
banana,box2

Instead of reversing the logic, you can reverse the action:
awk -F',' 'NR==FNR{a[$1,$2]; next} ($2,$3) in a {next}1' file1.csv file2.csv
date25,box1,banana,text
Or:
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1,$2]; next} ($2,$3) in a {next} {print $2, $3}' file1.csv file2.csv
box1,banana

Assumptions:
both files are comma delimited
find lines from file1.csv that do not have a match with fields #2 and #3 from file2.csv
for such entries (from file1.csv) print the fields in reverse order (OP has shown final output of banana,box2, which is the reverse order of box2,banana as listed in file1.csv)
A few modifications to OP's awk code:
awk '
BEGIN { FS=OFS="," }
NR==FNR { a[$1 FS $2] ; next } # add a delimiter to allow for splitting of index later
{ delete a[$2 FS $3] } # delete file #1 entries if found in file #2
END { for (i in a) { # anything left in array was not found in file #2
split(i,arr,FS) # split index on delimiter
print arr[2],arr[1] # print fields in reverse order
}
}
' file1.csv file2.csv
This generates:
banana,box2

Related

intersecting to files by several columns using awk

I have two big CSV files that look like below:
f1.csv
f1_c1,f1_c2
A,B
C,A
B,D
f2.csv
f2_c1,f2_c2,f2_c3
chr1,fail,A
chr1,pass,B
chr1,neutral,C
chr2,fail,D
I want to intersect the two files in a way that the information from the first column and the second column of file two should be written for each row of f1 in separate columns. So based on what I mentioned the desired output should be as below:
f1_c1,f1_c2,f2_c1,f2_c1,f2_c2,f2_c2
A,B,chr1,chr1,fail,pass
C,A,chr1,chr1,neutral,fail
B,D,chr1,chr2,pass,fail
I have been trying to make this code work but it gives errors - would be great if you give some help to fix this.
awk 'BEGIN{FS=OFS=","}NR==FNR{gene[$3]=$1; type{$3]=$2; next}{ print ($1, $2, gene[$1], gene[$2], type[$1], type[$2] ) }' f2.csv f1.csv
Thank you.
You may use this awk:
awk 'BEGIN{FS=OFS=","} NR==1{print "f1_c1,f1_c2,f2_c1,f2_c1,f2_c2,f2_c2"} FNR==NR {m1[$3]=$1; m2[$3]=$2; next} FNR>1 {print $0, m1[$1], m1[$2], m2[$1], m2[$2]}' f2.csv f1.csv
f1_c1,f1_c2,f2_c1,f2_c1,f2_c2,f2_c2
A,B,chr1,chr1,fail,pass
C,A,chr1,chr1,neutral,fail
B,D,chr1,chr2,pass,fail
Expanded command:
awk '
BEGIN { FS = OFS = "," }
NR == 1 {
print "f1_c1,f1_c2,f2_c1,f2_c1,f2_c2,f2_c2"
}
FNR == NR {
m1[$3]=$1
m2[$3]=$2;
next
}
FNR > 1 {
print $0, m1[$1], m1[$2], m2[$1], m2[$2]
}' f2.csv f1.csv

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

find unique lines based on one field only [duplicate]

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Split file on the value of a certain column into separate files and also include the header

fullfile.csv:
animal,number
rabbit,1
fish,2
mouse,1
dog,1
lizard,2
cat,2
And I want to split the file on the value in the second column,
and used this command:
awk 'BEGIN {FS = ","}; {print > ("file"$2".csv")}' fullfile.csv
Outputs:
file1.csv
rabbit,1
mouse,1
dog,1
file2.csv
fish,2
lizard,2
cat,2
However there is no header in file1.csv or file2.csv so I tried to add it like so:
awk 'BEGIN {FS = ","}; NR==1 { print } {print > ("file"$2".csv")}' fullfile.csv
But the header prints to the command line instead of going to each file. How do I get the header to be included in each file?
You can also specify the field separator outside of the awk script with awk -F",".
You can could store the header as a variable when NR==1. Store the file numbers in an array and write the header only once if the number is NOT in the array yet. Once the value is in the array, you will just write the lines to their respective file as you set it up before:
awk -F"," 'NR==1{header=$0}NR>1&&!a[$2]++{print header > ("file"$2".csv")}NR>1{print > ("file"$2".csv")}' fullfile.csv
Output:
file1.csv
animal,number
rabbit,1
mouse,1
dog,1
file2.csv
animal,number
fish,2
lizard,2
cat,2
Here is a simpler awk command with better formatting.
awk -F, '
NR==1 {hdr=$0; next}
{fn="file" $2 ".csv"}
!seen[$2]++{print hdr > fn}
{print > fn}' fullfile.csv
Sample output
$ for i in file*.csv; do echo $i; cat $i; echo; done
file1.csv
animal,number
rabbit,1
mouse,1
dog,1
file2.csv
animal,number
fish,2
lizard,2
cat,2

Values missing in awk

My Input files :
file1
231|35000
234|15000
242|60000
254|12313
345|50000
435|24300
file2
1|madhan|retl|231|tcs
2|vaisakh|retl|234|tcs
4|sam|ins|242|infy
5|tina|bfs|254|tcs
3|ram|bfs|345|infy
6|subbu|bfs|435|infy
Ouput :
Trying to get
col1 , col2 of file1 and col2 of file2 based on common column(col1 of file1 and col4 of file2)
My code :
awk 'BEGIN { FS="|";} NR==FNR{a[$1] = $2;next} ($4 in a) {print $2 "|" $4 "|" a[$1]} ' file_1 file_2
O/p i got:
madhan|231|
vaisakh|234|
sam|242|
tina|254|
ram|345|
subbu|435|
Can you help why last col is coming as spaces
Try something like:
join -t '|' -1 1 -2 4 file1 file2 | awk -F'|' '{print $1 "|" $2 "|" $4}'
Join on field 1 from file1 and field 4 on file 2 and extract fields what you need using awk.
This should do:
awk -F\| 'FNR==NR {a[$1]=$0;next} {for (i in a) if (i==$4) print a[i]"|"$2}' file1 file2
231|35000|madhan
234|15000|vaisakh
242|60000|sam
254|12313|tina
345|50000|ram
435|24300|subbu
It store file1 in array a using first field as index.
Then it test index in first file against fourth field in file2.
If they are equal, print data from file1 and second field from file2.
It is coming up blank because the key does not exist in the array. You are storing first column of file1 as key which is 4th column of file2.
$ awk '
BEGIN { FS=OFS="|" }
NR==FNR { a[$1]=$2; next }
($4 in a) { print $2, $4, a[$4] }
' file1 file2
madhan|231|35000
vaisakh|234|15000
sam|242|60000
tina|254|12313
ram|345|50000
subbu|435|24300
If you need the order stated in your requested O/P then
$ awk 'BEGIN {FS=OFS="|"}NR==FNR{a[$4]=$2;next} ($1 in a) {print $0, a[$1]}' file2 file1
231|35000|madhan
234|15000|vaisakh
242|60000|sam
254|12313|tina
345|50000|ram
435|24300|subbu

Resources