Comparing 2 columns from different files print matching columns - bash

I know similar questions have been asked which has led me write the current code but I am still not able to get the correct output.
Question:
If Column 1 (in file 1) matches Column 5 (in file 2), print all columns in file 2 and columns 3 and 4 (in file 1) to a new file.
File 1 (tab-delimited)
NJE_00001 rmf 6.2 Ribosome modulation factor
NJE_00002 rlm 7.1 Ribosomal RNA large subunit methyltransferase
NJE_00003 gnt 6.2 putative D-xylose utilization operon
NJE_00004 prp 4.1 2-methylisocitrate lyase
File 2 (tab-delimited)
AFC_04390 rmf 5.6 protein1 NJE_00001
AFC_04391 rlm 2.5 protein54 NJE_00002
AFC_04392 gnt 2.1 protein8 NJE_00003
AFC_04393 prp 4.1 protein5 NJE_00004
Desired Output (tab-delimited)
AFC_04390 rmf 5.6 protein1 NJE_00001 6.2 Ribosome modulation factor
AFC_04391 rlm 2.5 protein54 NJE_00002 7.1 Ribosomal RNA large subunit methyltransferase
AFC_04392 gnt 2.1 protein8 NJE_00003 6.2 putative D-xylose utilization operon
AFC_04393 prp 4.1 protein5 NJE_00004 5.9 2-methylisocitrate lyase
What I've tried:
awk -F '\t' 'NR==FNR {a[$1]=$3"\t"$4; next} ($5 in a) {print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[$1]}' file1.tsv file2.tsv > file.out
awk -F '\t' 'NR==FNR {a[$1]=$2; next} {if ($5 in a) {print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[$1]}}' file1.tsv file2.tsv > file.out
awk -F '\t' 'NR==FNR {h[$1]=$3"\t"$4; next} ($5 in h) {print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" h[$1]}' file1.tsv file2.tsv > file.out
They've all given the same outputs which is identical to file 2. Any help would be much appreciated! Thank you!

$ awk '
BEGIN { FS=OFS="\t" }
NR==FNR { a[$1]=$3 OFS $4; next }
{ print $0, a[$5] }
' file1 file2
AFC_04390 rmf 5.6 protein1 NJE_00001 6.2 Ribosome modulation factor
AFC_04391 rlm 2.5 protein54 NJE_00002 7.1 Ribosomal RNA large subunit methyltransferase
AFC_04392 gnt 2.1 protein8 NJE_00003 6.2 putative D-xylose utilization operon
AFC_04393 prp 4.1 protein5 NJE_00004 4.1 2-methylisocitrate lyase

Could you please try following.
awk '
FNR==NR{
val=$1
$1=$2=""
sub(/^ +/,"")
a[val]=$0
next
}
($NF in a){
print $0,a[$NF]
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file1 is being read.
val=$1 ##Creating variable val which has $1 of current line.
$1=$2="" ##Nullifying first and second fields here.
sub(/^ +/,"") ##Substituting initial space with NULL in current line.
a[val]=$0 ##Creating an array named a with index val and value of current line.
next ##next will skip further lines from here.
}
($NF in a){ ##Checking condition if $NF(last field of current line) is present in array a then do following.
print $0,a[$NF] ##Printing current line with array a with index $NF value.
}
' file1 file2 ##Mentioning Input_file names here.

As long as your files are sorted on the first and fifth columns respectively like in your sample, it's easy to do with join
join -t$'\t' -11 -25 -o 2.1,2.2,2.3,2.4,2.5,1.3,1.4 file1.tsv file2.tsv > joined.tsv
If not sorted, use <(sort -t$'\t' -k1,1 file1.tsv) and/or <(sort -t$'\t' -k5,5 file2.tsv) instead of just the filename as arguments to join.

Related

How to compare two files and print the values of both the files which are different

There are 2 files. I need to sort them first and then compare the 2 files and then the difference I need to print the value from File 1 and File 2.
file1:
pair,bid,ask
AED/MYR,3.918000,3.918000
AED/SGD,3.918000,3.918000
AUD/CAD,3.918000,3.918000
file2:
pair,bid,ask
AUD/CAD,3.918000,3.918000
AUD/CNY,3.918000,3.918000
AED/MYR,4.918000,4.918000
Output should be:
pair,inputbid,inputask,outputbid,outtputask
AED/MYR,3.918000,3.918000,4.918000,4.918000
The only difference in 2 files is AED/MYR with different bid/ask rates. How can I print difference value from file 1 and file 2.
I tried using below commands:
nawk -F, 'NR==FNR{a[$1]=$4;a[$2]=$5;next} !($4 in a) || !($5 in a) {print $1 FS a[$1] FS a[$2] FS $4 FS $5}' file1 file2
Result output as below:
pair,bid,ask,bid,ask
AUD/CAD,3.918000,3.918000,3.918000,3.918000
AUD/CHF,3.918000,3.918000,3.918000,3.918000
AUD/CNH,3.918000,3.918000,3.918000,3.918000
AUD/CNY,3.918000,3.918000,3.918000,3.918000
AED/MYR,3.918000,3.918000,4.918000,4.918000
We are still not able to get only the difference.
Could you please try following, written and tested in GNU awk with shown samples.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr) && arr[$1]!=$0{
val=$1
$1=""
sub(/^,/,"")
if(!found){
print header
found=1
}
print arr[val],$0
}' Input_file1 Input_file2
Explanation: Adding detailed explanation for above.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" ' ##Starting awk program from here and setting this to header value here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file1 is being read.
arr[$1]=$0 ##Creating arr with index $1 and keep value as current line.
next ##next will skip all further statements from here.
}
($1 in arr) && arr[$1]!=$0{ ##Checking condition if first field is present in arr and its value NOT equal to $0
val=$1 ##Creating val which has current line value in it.
$1="" ##Nullifying irst field here.
sub(/^,/,"") ##Substitute starting , with NULL here.
if(!found){ ##Checking if found is NULL then do following.
print header ##Printing header here only once.
found=1 ##Setting found here.
}
print arr[val],$0 ##Printing arr with index of val and current line here.
}' Input_file1 Input_file2 ##Mentioning Input_files here.
With bash process substitution, then join and then choosing with awk:
# print header
printf "%s\n" "pair,inputbid,inputask,outputbid,outtputask"
# remove first line from both files, then sort them on first field
# then join them on first field and output first 5 fields
join -t, -11 -21 -o1.1,1.2,1.3,2.2,2.3 <(tail -n +2 file1 | sort -t, -k1) <(tail -n +2 file2 | sort -t, -k1) |
# output only those lines, that columns differ
awk -F, '$2 != $4 || $3 != $5'

How to get percentage of a column based on key in unix

I have table like
Student_name,Subject
Ram,Maths
Ram,Science
Arjun,Maths
Arjun,Science
Arjun,Social
Arjun,Social
Output : I need to report only 'student' whose 'Social' subject percentage is more than 49%
Final output
Arjun, social, 50
.
Temp output(backend)
Student_name,Subject,Percentage(group by student name)
Ram,Maths,50
Ram,Science,50
Arjun,Maths,25
Arjun,Science,25
Arjun,Social,50
I have tried with below awk commands but I see percentage on complete subjects irrespective group by student name.
awk -F, '{x++;}{a[$1,$2]++;}END{for (i in a)print i, a[i],(a[i]/x)*100;}' OFS=, test1.csv > output2.dat
awk -F, '$2=="Science" && $3>=49{ print $1}' output2.dat
And Can we get it in single awk command.
Try following awk too once where it will provide the output in same order in which Input_file is data is there.
awk 'FNR>1 && FNR==NR{a[$1]++;b[$1]=$0;next} FNR==1 && FNR!= NR{print $0,"percentage";next}($1 in b){print $0"\t"100/a[$1]"%"}' Input_file Input_file
EDIT: Adding non-one liner form of solution too now.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=$0;
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
print $0"\t"100/a[$1]"%"
}
' Input_file Input_file
EDIT1: Adding new solution as per OP's change in requirement.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=b[$1]?b[$1] ORS $0:$0;
c[$1,$2];
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
if($2=="Science" && (100/a[$1])>49){
print b[$1]
}
}
' Input_file Input_file
GNU awk solution:
awk -F, 'NR==1{ print $0,"Percentage" }NR>1{ a[$1][$2]++ }
END{
for(i in a) for(j in a[i]) print i,j,(a[i][j]/length(a[i])*100"%")
}' OFS=',' test1.csv | column -t
The output:
Student_name,Subject,Percentage
Ram,Maths,50%
Ram,Science,50%
Arjun,Social,66.6667%
Arjun,Maths,33.3333%
Arjun,Science,33.3333%
Use a Numeric Comparison
You can do this with a very simple numeric comparison against the third field:
$ awk '$3 > 49 {print}' /tmp/input
Student_name Subject Percentage(group by student name)
Ram Maths 50%
Ram Science 50%
For this comparison, AWK coerces to a string, so the comparion will treat 50% the same as 50. As a nice byproduct, if the third field doesn't contain any numbers then it does a string comparison. The header line is greater than ! so it matches, too.

Compare two columns of two files and display the third column with input if it matching of not in unix

I would like to compare the first two columns of two files file1.txt and file2.txt and if they match to write to another file output.txt with the third column of both file1,file 2 along with details if it matches or not .
file1.txt
ab|2001|name1
cd|2000|name2
ef|2002|name3
gh|2003|name4
file2.txt
xy|2001|name5
cd|2000|name6
ef|2002|name7
gh|2003|name8
output.txt
name1 name5 does not match
name2 name6 matches
name3 name7 matches
name4 name8 matches
Welcome to stack overflow, could you please try following and let me know if this helps you.
awk -F"|" 'FNR==NR{a[$2]=$1;b[$2]=$3;next} ($2 in a) && ($1==a[$2]){print b[$2],$3,"matched properly.";next} {print b[$2],$3,"Does NOT matched."}' file1.txt file2.txt
EDIT: Adding a non-one liner form of solution too here.
awk -F"|" '
FNR==NR{
a[$2]=$1;
b[$2]=$3;
next
}
($2 in a) && ($1==a[$2]){
print b[$2],$3,"matched properly.";
next
}
{
print b[$2],$3,"Does NOT matched."
}
' file1.txt file2.txt
Explanation: Adding explanation for above code.
awk -F"|" ' ##Starting awk program from here and setting field separator as | here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1.txt is being read.
a[$2]=$1; ##Creating an array with named a whose index is $2 and value is $1.
b[$2]=$3; ##Creating an array named b whose index is $2 and value is $3.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && ($1==a[$2]){ ##Checking condition if $2 is in array a AND first field is equal to value of array a value of index $2.
print b[$2],$3,"matched properly."; ##Printing array b value with index $2 and 3rd field with string value matched properly.
next ##next will skip all statements from here.
} ##Closing BLOCK for above condition here.
{
print b[$2],$3,"Does NOT matched." ##Printing value of array b with index $2 and 3rd field with string Does NOT matched here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
You can use paste and awk to get what you want.
Below solution is assuming the fields in file1 and file2 will be always delimited by "|"
paste -d "|" file1.txt file2.txt | awk -F "|" '{ if( $1 == $4 && $2 == $5 ){print $3, $6, "matches"} else {print $3, $6, "does not match"} }' > output.txt

Find in word from one file to another file using awk, NR==FNR

I am trying to find words from one file to another file using following awk command:
awk 'NR==FNR{a[$2]=$1;next}($1 in a){print a[$2] " " $2}' file1 file2
content of files are :
file1:
vidhu 1
gangwar 2
file2:
1 1
2 4980022
Expected Output:
vidhu 1
gangwar 4980022
But output is coming like that :
vidhu 1
4980022
Help me out to find this problem.
#try:
awk 'FNR==NR{A[$2]=$1;next} ($1 in A){print A[$1], $2}' Input_file1 Input_file2
Explanation:
awk 'FNR==NR #### Checking condition of FNR==NR which will be TRUE when first Input_file1 is being read.
{A[$2]=$1; #### create an array named A with index of field 2 and have value as first field.
next} #### using next keyword(built-in awk) so it will skip all next statements.
($1 in A) #### Now checking if first field of file2 is present in array A, this will be checked only when Input_file2 is being read.
{print A[$1], $2 #### printing value of array A's value whose index is $1 and $2 of Input_file2.
}' Input_file1 Input_file2 #### Mentioning the Input_file1 and Input_file2 here.
You can also use something like this:
join -1 2 -2 1 file1 file2 -o 1.1,2.2
Assuming that your files are in sorted order.
If not then first sort them.

using sed, awk, or sort for csv manipulation

I have a csv file that needs a lot of manipulation. Maybe by using awk and sed?
input:
"Sequence","Fat","Protein","Lactose","Other Solids","MUN","SCC","Batch Name"
1,4.29,3.3,4.69,5.6,11,75,"35361305a"
2,5.87,3.58,4.41,5.32,10.9,178,"35361305a"
3,4.01,3.75,4.75,5.66,12.2,35,"35361305a"
4,6.43,3.61,3.56,4.41,9.6,275,"35361305a"
final output:
43330075995647
59360178995344
40380035995748
64360275964436
I'm able to get through some of it going step by step.
How do I test specific columns for a value over 9.9 and replace it with 9.9 ?
Also, is there a way to combine any of these steps?
remove first line:
tail -n +2 test.csv > test1.txt
remove commas:
sed 's/,/ /g' test1.txt > test2.txt
remove quotes:
sed 's/"//g' test2.txt > test3.txt
remove columns 1 and 8 and
reorder remaining columns as 1,2,6,5,4,3:
sort test3.txt | uniq -c | awk '{print $3 "\t" $4 "\t" $8 "\t" $7 "\t" $6 "\t" $5}' test4.txt
test new columns 1,2,4,5,6 - if the value is over 9.9, replace it with 9.9
How should I do this step?
solution for following parts were found in a previous question - reformating a text file
columns 1,2,4,5,6 round decimals to tenths
column 3 needs to be four characters long, using zero to left fill
remove periods and spaces
awk '{$0=sprintf("%.1f%.1f%4s%.1f%.1f%.1f", $1,$2,$3,$4,$5,$6);gsub(/ /,"0");gsub(/\./,"")}1' test5.txt > test6.txt
This produces the output you want from the original file. Note that in the question you specified - note that in the question you specified "column 4 round to whole number" but in the desired output you had rounded it to one decimal place instead:
awk -F'[,"]+' 'function m(x) { return x < 9.9 ? x : 9.9 }
NR > 1 {
s = sprintf("%.1f%.1f%04d%.1f%.1f%.1f", m($2),m($3),$7,m($6),m($5),m($4))
gsub(/\./, "", s)
print s
}' test.csv
I have specified the field separator as any number of commas and double quotes together, so this "parses" your CSV format for you without requiring any additional steps.
The function m returns the minimum of 9.9 and the number you pass to it.
Output:
43330075995647
59360178995344
40380035995748
64360275964436
The three first in one go:
awk -F, '{gsub(/"/,"");$1=$1} NR>1' test.csc
1 4.29 3.3 4.69 5.6 11 75 35361305a
2 5.87 3.58 4.41 5.32 10.9 178 35361305a
3 4.01 3.75 4.75 5.66 12.2 35 35361305a
4 6.43 3.61 3.56 4.41 9.6 275 35361305a
tail -n +2 file | sort -u | awk -F , '
{
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Or
< file awk -F , '
NR > 1 {
$0 = $1 FS $2 FS $6 FS $5 FS $4 FS $3
for (i = 1; i <= 6; ++i)
if ($i > 9.9)
$i = 9.9
$0 = sprintf("%.1f%.1f%4s%.0f%.1f%.1f", $1, $2, $3, $4, $5, $6)
gsub(/ /, "0"); gsub(/[.]/, "")
print
}
'
Output:
104309964733
205909954436
304009964838
406409643636

Resources