Matching numbers in two different files using awk - bash

I have two files (f1 and f2), both made of three columns, of different lengths. I would like to create a new file of four columns in the following way:
f1 f2
1 2 0.2 1 4 0.3
1 3 0.5 1 5 0.2
1 4 0.2 2 3 0.6
2 2 0.5
2 3 0.9
If the numbers in the first two columns are present in both files, then we print the first two numbers and the third number of each file (e.g. in both there is 1 4, in f3 there should be 1 4 0.2 0.3; otherwise, if the two first numbers are missing in f2 just print a zero in the fourth column.
The complete results of these example should be
f3
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6
The script that I wrote is the following:
awk '{str1=$1; str2=$2; str3=$3;
getline < "f2";
if($1==str1 && $2==str2)
print str1,str2,str3,$3 > "f3";
else
print str1,str2,str3,0 > "f3";
}' f1
but it just looks if the same two numbers are in the same row (it does not go through all the f2 file) giving as results
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0
2 2 0.5 0
2 3 0.9 0

This awk should work:
awk 'FNR==NR{a[$1,$2]=$3;next} {print $0, (a[$1,$2])? a[$1,$2]:0}' f2 f1
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6

Related

I have a tab delimited data frame sorted by the first column, how can I add an empty line between unique values of the first column?

I would like my data to look like:
-2 -2 1
-2 -1 0
-2 0 1
-2 1 1.5
-2 2 2
-1 -2 1.5
-1 -1 0.5
-1 0 1.5
-1 1 1
-1 2 1.5
0 -2 1.3
0 -1 0.2
0 0 1.6
0 1 1.2
0 2 2.3
Where there are 3 tab separated columns total, all 3 data types are doubles, and the first column is separated by unique values by a newline.
Currently, I have similar data, but no separation. Any ideas on how I can do this in bash or something similar?
awk 'NR > 1 && $1 != prev {print ""} {print; prev=$1}' file

Add lines with 0 for missing values in a datatable

I have a dataset counting occurences of bins, for instance:
1 10
2 15
3 1
5 50
8 990
As you can see, I am missing bins in the first column. As I want to plot this data, I'm looking for a way to add those missing value, with a 0 on the second column, e.g. if I know my bins go up to 10:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
I'm looking for a unix/bash solution as it fits my pipeline and my files are rather big, but maybe R is more suited for this ?
EDIT: Thanks to karafaka sir, adding solutions which will capture very first line's digits too.
awk -v value=10 '$1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Let's say following is the Input_file:
cat Input_file
3 10
4 15
7 1
9 50
19 990
Then after running above code we will get following output.
1 0
2 0
3 10
4 15
5 0
6 0
7 1
8 0
9 50
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 990
Could you please try following.
awk -v value=10 'prev && $1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Adding a non-one liner form of solution too now.
awk -v value=10 '
prev && $1-prev>1{
while(++prev<$1){
print prev,"0"
}
}
{
prev=$1
print
}
END{
if(prev<value){
while(prev<=value){
print prev,"0"
prev++
}
}
}' Input_file
we can combine seq and awk to make the task easier:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' file <(seq 10)
You can do this as well:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$0}' f <(seq -f '%g 0' 10)
Test with your data:
kent$ cat f
1 10
2 15
3 1
5 50
8 990
kent$ awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' f <(seq 10)
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
Using Bash and join:
$ join -a 1 --nocheck-order -e 0 -o 1.1,2.2 <(seq 10) file
Output:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
another awk
$ awk -v mx=10 '{while(++k<$1) print k,0}1;
END {while(k++<mx) print k,0}' file
this will fill the first records if missing as well.
$ awk '{n[$1]=$2} END{for (i=1;i<=10;i++) print i,n[i]+0}' file
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

Find header value for first occurance of "1" instance in column

I have a matrix example:
1 3 5 8 10 12
50 1 1 1 1 1 1
100 0 0 1 1 1 1
150 0 0 1 1 1 1
200 0 0 0 1 1 1
250 0 0 0 0 1 1
300 0 0 0 0 1 1
350 0 0 0 0 0 1
For each row name (50, 100, 150, 200, etc.) I want to know what is the "header" value when the instance "1" first occurs. Based on the example the answer is:
50 1
100 5
150 5
200 8
250 10
300 10
350 12
I am not sure how to play with IFs and WHENs to get my answer from this format. R, Excel, bash, awk, all welcome as solutions.
You can do this using awk as following :
$ awk 'FNR==1{for(i=1; i<=NF; i++){a[i]=$i}; next} {for(i=2; i<=NF; i++){if($i=="1"){print $1, a[i-1]; break}}} ' file
50 1
100 5
150 5
200 8
250 10
300 10
350 12
Explanation :
For header i.e FNR==1 we are populating all values in the array a;
For all next lines we are checking which field equates to 1, if found print the col1 value i.e $1 and the corresponding value in the array a and break the loop.
Awk solution:
awk 'NR==1{ for(i=1;i<=NF;i++) h[i]=$i; next }
{
for(i=2;i<=NF;i++) { if($i==1) { n=h[i-1]; break } }
print $1,(n)?n:"None"; n=""
}' file

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

Sum values of specific column in multiple files, considering ranges defined in another file

I have a file (let say file B) like this:
File B:
A1 3 5
A1 7 9
A2 2 5
A3 1 3
The first column defines a filename and the other two define a range in that specific file. In the same directory, I have also three more files (File A1, A2 and A3). Here is 10 sample lines from each file:
File A1:
1 0.6
2 0.04
3 0.4
4 0.5
5 0.009
6 0.2
7 0.3
8 0.2
9 0.15
10 0.1
File A2:
1 0.2
2 0.1
3 0.2
4 0.4
5 0.2
6 0.3
7 0.8
8 0.1
9 0.9
10 0.4
File A3:
1 0.1
2 0.2
3 0.5
4 0.3
5 0.7
6 0.3
7 0.3
8 0.2
9 0.8
10 0.1
I need to add a new column to file B, which in each line gives the sum of values of column two in the defined range and file. For example, file B row 1 means that calculate the sum of values of line 3 to 5 in column two of file A1.
The desired output is something like this:
File B:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
All files are in tabular text format. How can I perform this task? I have access to bash (ubuntu 14.04) and R but not an expert bash or R programmer.
Any help would be greatly appreciated.
Thanks in advance
Given the first file fileB and 3 input files A1, A2 and A3, each with two columns, this produces the output that you want:
#!/bin/bash
while read -r file start end; do
sum=$(awk -vs="$start" -ve="$end" 'NR==s,NR==e{sum+=$2}END{print sum}' "$file")
echo "$file $start $end $sum"
done < fileB
This uses awk to sum up the values between lines between the range specified by the variables s and e. It's not particularly efficient, as it loops through $file once per line of fileB but depending on the size of your inputs, that may not be a problem.
Output:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
To redirect the output to a file, just add > output_file to the end of the loop. To overwrite the original file, you will need to first write to a temporary file, then overwrite the original file (e.g. > tmp && mv tmp fileB).

Resources