Compare multiple Columns and Append the result into another file - shell

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat

Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)

$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

Related

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

cut a field from its position & place it in different position

I have 2 files - file1 & file2 with contents as shown.
cat file1.txt
1,2,3
cat file2.txt
a,b,c
& the desired output is as below,
a,1,b,2,c,3
Can anyone please help to achieve this?
Till now i have tried this,
paste -d "," file1.txt file2.txt|cut -d , -f4,1,5,2,6,3
& the output came as 1,2,3,a,b,c
But using 'cut' is not the good approach i think.
Becuase here i know there are 3 values in both files, but if the values are more, above command will not be helpful.
try:
awk -F, 'FNR==NR{for(i=1;i<=NF;i++){a[FNR,i]=$i};next} {printf("%s,%s",a[FNR,1],$1);for(i=2;i<=NF;i++){printf(",%s,%s",a[FNR,i],$i)};print ""}' file2.txt file1.txt
OR(a NON-one liner form of solution too as follows)
awk -F, 'FNR==NR{ ####making field separator as , then putting FNR==NR condition will be TRUE when first file named file1.txt will be read by awk.
for(i=1;i<=NF;i++){ ####Starting a for loop here which will run till the total number of fields value from i=1.
a[FNR,i]=$i ####creating an array with name a whose index is FNR,i and whose value is $i(fields value).
};
next ####next will skip all further statements, so that second file named file2.txt will NOT read until file1.txt is completed.
}
{
printf("%s,%s",a[FNR,1],$1); ####printing the value of a very first element of each lines first field here with current files first field.
for(i=2;i<=NF;i++){ ####starting a for loop here till the value of NF(number of fields).
printf(",%s,%s",a[FNR,i],$i) ####printing the values of array a value whose index is FNR and variable i and printing the $i value too here.
};
print "" ####printing a new line here.
}
' file2.txt file1.txt ####Mentioning the Input_files here.
paste -d "," file*|awk -F, '{print $4","$1","$5","$2","$6","$3}'
a,1,b,2,c,3
This is simple printing operation. Other answers are most welcome.
But if the file contains 1000's of values, then this printing approach will not help.
$ awk '
BEGIN { FS=OFS="," }
NR==FNR { split($0,a); next }
{
for (i=1;i<=NF;i++) {
printf "%s%s%s%s", $i, OFS, a[i], (i<NF?OFS:ORS)
}
}
' file1 file2
a,1,b,2,c,3
or if you prefer:
$ paste -d, file2 file1 |
awk '
BEGIN { FS=OFS="," }
{
n=NF/2
for (i=1;i<=n;i++) {
printf "%s%s%s%s", $i, OFS, $(i+n), (i<n?OFS:ORS)
}
}
'
a,1,b,2,c,3

Compare two csv files, use the first three columns as identifier, then print common lines

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.
Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]
awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

Compare strings using awk command

I am at initial stage of learning shell scripting. So please explain me the steps for better understanding.
Consider I have two files
Content of the two files are as below:
File1.txt
ABC=10
DEF=20
XYZ=30
File2.txt
DEF=15
XYZ=20
I want to write a simple shell script to check both the files and add the values and print the final output as below. like
ABC=10
DEF=35
XYZ=50
You can use awk:
awk 'BEGIN{FS=OFS="="} FNR==NR{a[$1]=$2;next} {a[$1]+=$2}
END{for (i in a) print i, a[i]}' file1 file2
ABC=10
XYZ=50
DEF=35
Breakup:
NR == FNR { # While processing the first file
a[$1] = $2 # store the second field by the first in an array
next # move to next record
}
{ # while processing the second file
a[$1]+=$2 # add already stored value by 2nd field in 2nd file
}
END{..} # iterate the array and print the values
If you want to keep original ordering intact then use:
awk 'BEGIN{FS=OFS="="} FNR==NR{if (!($1 in a)) b[++n]=$1; a[$1]=$2;next} {a[$1]+=$2}
END{for (i=1; i<=n; i++) print b[i], a[b[i]]}' file1 file2
ABC=10
DEF=35
XYZ=50

Resources