Comparing 2 files using AWK with multiple parameters - bash

I have a problem while comparing 2 text files using awk. Here is what I want to do.
File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.
File1:
scaffold10| 300 T C 0.9695 0.0000
scaffold10| 456 T A 1.0000 0.0000
scaffold10| 470 C A 0.9906 0.0000
scaffold10| 600 T C 0.8423 0.0000
scaffold56| 5 A C 0.8423 0.0000
scaffold56| 1000 C T 0.8423 0.0000
scaffold56| 6000 C C 0.7518 0.0000
scaffold7| 2 T T 0.9046 0.0000
scaffold9| 300 T T 0.9034 0.0000
scaffold9| 10900 T G 0.9044 0.0000
File2:
scaffold10| 400 550
scaffold10| 700 800
scaffold56| 3 5000
scaffold7| 55 200
scaffold7| 214 567
scaffold7| 656 800
scaffold9| 234 675
scaffold9| 699 1254
scaffold9| 10887 11000
Output:
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000
My awk try:
awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt
How can I get the output I want using awk? Any suggestions are very welcome...

Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000
Or if you want the output formatted the same the way it is in the example you gave:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000

A awk solution that reads in the first file into an array and then compares it on the fly with the content of the second file.
awk 'NR==FNR{i++; x[i]=$0; x_1[i]=$2; x_2[i]=$3 }
NR!=FNR{ for(j=1;j<=i;j++){
if( $1~x[j] && x_1[j]<$2 && x_2[j]>$2 ){
print $0,x[j]
}
}
}' file2 file1
# scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
# scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
# scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
# scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
# scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
# scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000

Related

Vlookup in awk: how to list anything occuring in file2 but not in file1 at the end of output?

I have two files, file 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
and file 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 804 804 0.24
I have an awk script that looks in the second file for values that match the first three columns of the first file.
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found
Is there a way to make it such that any rows occurring in file 2 that are not in file 1 are still added at the end of the output, after the matches, such as this:
0.55
0.09
0.88
not found
4 804 804 0.24
That way, when I paste the two files back together, they will look something like this:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 not found
4 804 804 not found 0.04
Or is there any other more elegant solution with completely different syntax?
awk '{k=$1FS$2FS$3}NR==FNR{a[k]=$4;next}
k in a{print $4;next}{print "not found";print}' f1 f2
The above one-liner will give you:
0.55
0.09
0.88
not found
4 804 804 0.24

How to compare multiple columns in two files and retrieve the corresponding value from another column if match found

I have two files File1.txt and File2.txt. I need to compare the three columns 1, 2 and 3 from File1 with 4,5 and 6 of File2, respectively and if the match is found, I want to retrieve the corresponding value from the column 2 of File2 and paste it in Output.txt. The sample files are as below:
File1.txt
ASP B 276
ASN B 290
ALA B 294
ALA B 297
ARG B 298
ARG B 303
LYS D 288
File2.txt
ATOM 4770 N ALA A 346 71.417 37.005 4.562 1 0 N
ATOM 4778 C ALA A 346 72.003 34.855 3.476 1 0 C
ATOM 4779 O ALA A 346 72.956 34.2 3.103 1 0 O
ATOM 4859 N SER A 353 78.218 33.415 -2.595 1 0 N
ATOM 4867 HG SER A 353 78.828 31.548 0.899 1 0
ATOM 4868 C SER A 353 79.637 31.351 -2.619 1 0 C
ATOM 4869 O SER A 353 80.669 30.76 -2.372 1 0 O
ATOM 9771 N ASP B 238 52.86 30.061 -7.031 1 0 N
ATOM 9772 H ASP B 238 53.651 30.105 -7.641 1 0
ATOM 9776 HB1 ASP B 238 53.516 32.92 -5.486 1 0
ATOM 10352 H ASP B 276 11.565 35.255 6.968 1 0
ATOM 10356 HB1 ASP B 276 10.084 33.659 6.727 1 0
ATOM 10357 HB2 ASP B 276 10.331 32.059 6.945 1 0
ATOM 10358 CG ASP B 276 9.946 33.07 8.681 1 0 C
ATOM 10453 H ASN B 290 16.73 30.519 13.339 1 0
ATOM 10454 CA ASN B 290 18.755 31.013 13.763 1 0 C
ATOM 10458 HB2 ASN B 290 20.105 29.465 13.891 1 0
ATOM 10459 CG ASN B 290 18.471 28.842 14.99 1 0 C
ATOM 10460 OD1 ASN B 290 18.246 29.429 16.072 1 0 O
ATOM 10512 H ALA B 294 24.099 33.167 8.943 1 0
ATOM 10513 CA ALA B 294 26.095 33.794 9.273 1 0 C
ATOM 10514 HA ALA B 294 26.597 34.261 8.545 1 0
ATOM 10515 CB ALA B 294 25.515 34.817 10.199 1 0 C
ATOM 10556 H ALA B 297 28.288 31.299 7.752 1 0
ATOM 10557 CA ALA B 297 30.202 31.869 7.061 1 0 C
ATOM 10558 HA ALA B 297 30.566 31.457 6.226 1 0
ATOM 10566 H ARG B 298 30.012 32.059 9.568 1 0
ATOM 10567 CA ARG B 298 31.961 32.047 10.392 1 0 C
ATOM 10568 HA ARG B 298 32.532 32.853 10.237 1 0
ATOM 10569 CB ARG B 298 31.251 32.167 11.74 1 0 C
ATOM 10650 HE ARG B 303 36.405 23.564 2.394 1 0
ATOM 10651 CZ ARG B 303 34.807 22.582 3.07 1 0 C
ATOM 10652 NH1 ARG B 303 33.867 22.493 3.991 1 0 N
ATOM 10653 1HH1 ARG B 303 33.829 23.162 4.733 1 0
ATOM 10654 2HH1 ARG B 303 33.192 21.757 3.947 1 0
ATOM 10655 NH2 ARG B 303 34.847 21.706 2.081 1 0 N
ATOM 17143 OE1 GLU C 295 59.322 13.561 -6.631 1 0 O
ATOM 17144 OE2 GLU C 295 57.646 14.02 -7.941 1 0 O
ATOM 17145 C GLU C 295 54.718 13.527 -3.448 1 0 C
ATOM 17146 O GLU C 295 54.509 14.618 -2.982 1 0 O
ATOM 23627 HB1 LYS D 288 32.909 52.854 29.282 1 0
ATOM 23628 HB2 LYS D 288 31.41 53.372 29.672 1 0
ATOM 23629 CG LYS D 288 32.811 53.749 31.138 1 0 C
ATOM 23630 HG1 LYS D 288 32.137 53.82 31.873 1 0
ATOM 23631 HG2 LYS D 288 33.636 53.303 31.484 1 0
ATOM 23632 CD LYS D 288 33.168 55.146 30.656 1 0 C
The output.txt should contain the values of column 2 of second file only if the three columns of File1 matches with the three columns of File2.
Output.txt
10352
10356
10357
10358
10453
10454
10458
10459
10460
10512
10513
10514
10515
10556
10557
10558
10566
10567
10568
10569
10650
10651
10652
10653
10654
10655
23627
23628
23629
23630
23631
23632
I have tried with awk one liner, which is provided below. The script was executed but has retrieved different values of the column 2 of File2. Hence, I need help in resolving this and finding out where am I going wrong.
awk 'FNR==NR{a[$1,$2,$3]=$0;next}{if(b=a[$4,$5,$6]){print $2}}' File1.txt File2.txt > Output.txt
Thanks in advance.
Asha,
MBU, IISc,
Bangalore, India
You can use this awk command:
awk 'FNR==NR{a[$1,$2,$3]; next} ($4,$5,$6) in a{print $2}' file1 file2
10352
10356
10357
10358
10453
10454
10458
10459
10460
10512
10513
10514
10515
10556
10557
10558
10566
10567
10568
10569
10650
10651
10652
10653
10654
10655
23627
23628
23629
23630
23631
23632

subtracting data from columns in bash csv

I have several columns in a file. I want to subtract two columns...
They have these form...without decimals...
1.000 900
1.012 1.010
1.015 1.005
1.020 1.010
I need another column in the same file with the subtract
100
2
10
10
I have tried
awk - F "," '{$16=$4-$2; print $1","$2","$3","$4","$5","$6}'
but it gives me...
0.100
0.002
0.010
0.010
Any indication?
Using this awk:
awk -v OFS='\t' '{p=$1;q=$2;sub(/\./, "", p); sub(/\./, "", q); print $0, (p-q)}' file
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10
Using perl:
perl -lanE '$,="\t",($x,$y)=map{s/\.//r}#F;say#F,$x-$y' file
prints:
1.000 900 100
1.012 1.010 2
1.015 1.005 10
1.020 1.010 10

x,y points at 95% confidence interval using awk

I'm pretty new to linux and want to use bash/awk to find the x points where y=.975 and y=.025 (95% confidence interval) which I can then use to give me the 'width' of my broad peak (the data kinda makes a bell curve shape).
This is the set of data with x,y values like so (NOTE: I intend to make dx increment much smaller resulting in much more/finer points):
0 0
0.100893 0
0.201786 0
0.302679 0
0.403571 0
0.504464 0
0.605357 0
0.70625 0
0.807143 0
0.908036 0
1.00893 0
1.10982 0
1.21071 0
1.31161 0
1.4125 0.00173803
1.51339 0.0186217
1.61429 0.0739904
1.71518 0.211295
1.81607 0.725379
1.91696 2.34137
2.01786 4.69752
2.11875 6.58415
2.21964 6.06771
2.32054 8.57593
2.42143 11.7745
2.52232 12.4957
2.62321 13.0301
2.72411 11.1008
2.825 11.4504
2.92589 12.6537
3.02679 12.1584
3.12768 11.0262
3.22857 6.89166
3.32946 5.88521
3.43036 6.48794
3.53125 5.0121
3.63214 2.70189
3.73304 0.914824
3.83393 0.154436
3.93482 0.0286775
4.03571 0.00533823
4.13661 0.00024829
4.2375 0
4.33839 0
4.43929 0
4.54018 0
4.64107 0
4.74196 0
4.84286 0
4.94375 0
5.04464 0
5.14554 0
5.24643 0
5.34732 0
5.44821 0
5.54911 0
First I want to normalise the y data so the values add up to a total of 1 (essentially to give me a probability of finding x at point y).
Then I want to determine the x-values that mark the start and end of the 95% confidence interval for the data set. The way I tackled this was to have a running sum of the column 2 y-values and then do runsum/'sum' in this way the values should fill up from 0-1 (see below). (NOTE: I used column -t to clean up the output a little)
sum=$( awk 'BEGIN {sum=0} {sum+=$2} END {print sum}' mydata.txt )
awk '{runsum += $2} ; {if (runsum!=0) {print $0,$2/'$sum',runsum/'$sum'} else{print $0,"0","0"}}' mydata.txt | column -t
This gives:
0 0 0 0
0.100893 0 0 0
0.201786 0 0 0
0.302679 0 0 0
0.403571 0 0 0
0.504464 0 0 0
0.605357 0 0 0
0.70625 0 0 0
0.807143 0 0 0
0.908036 0 0 0
1.00893 0 0 0
1.10982 0 0 0
1.21071 0 0 0
1.31161 0.00136559 8.92134e-06 8.92134e-06
1.4125 0.0259463 0.000169506 0.000178427
1.51339 0.159775 0.0010438 0.00122223
1.61429 0.552197 0.00360748 0.00482971
1.71518 1.2808 0.00836741 0.0131971
1.81607 2.20568 0.0144096 0.0276067
1.91696 3.29257 0.0215102 0.049117
2.01786 4.27381 0.0279206 0.0770376
2.11875 7.10469 0.0464146 0.123452
2.21964 9.56549 0.062491 0.185943
2.32054 11.3959 0.0744489 0.260392
2.42143 8.16116 0.0533165 0.313709
2.52232 9.08145 0.0593287 0.373037
2.62321 9.3105 0.0608251 0.433863
2.72411 10.8084 0.0706108 0.504473
2.825 10.4597 0.0683328 0.572806
2.92589 9.81763 0.0641382 0.636944
3.02679 9.06295 0.0592079 0.696152
3.12768 8.84222 0.0577659 0.753918
3.22857 10.285 0.0671915 0.82111
3.32946 8.37618 0.0547212 0.875831
3.43036 7.02052 0.0458648 0.921696
3.53125 4.82589 0.0315273 0.953223
3.63214 3.39214 0.0221607 0.975384
3.73304 2.2402 0.0146351 0.990019
3.83393 1.06194 0.00693761 0.996956
3.93482 0.350213 0.00228793 0.999244
4.03571 0.091619 0.000598543 0.999843
4.13661 0.0217254 0.000141931 0.999985
4.2375 0.00211046 1.37875e-05 0.999999
4.33839 0 0 0.999999
4.43929 0 0 0.999999
4.54018 0 0 0.999999
4.64107 0 0 0.999999
4.74196 0 0 0.999999
4.84286 0 0 0.999999
4.94375 0 0 0.999999
5.04464 0 0 0.999999
5.14554 0 0 0.999999
5.24643 0 0 0.999999
5.34732 0 0 0.999999
5.44821 0 0 0.999999
5.54911 0 0 0.999999
I guess I could use this to find the x points where y=.975 and y=.025 and solve my problem but do you guys know of a more elegant way and is this doing what I think it is?
The 95% confidence interval is displayed at the bottom of the output:
$ awk -v "sum=$sum" -v lower=N -v upper=N '{runsum += $2; cdf=runsum/sum; printf "%10.4f %10.4f %10.4f %10.4f",$1,$2,$2/sum,cdf; print ""} lower=="N" && cdf>0.025{lower=$1} upper=="N" && cdf>0.975 {upper=$1} END{printf "lower=%s upper=%s\n",lower,upper}' mydata.txt
0.0000 0.0000 0.0000 0.0000
0.1009 0.0000 0.0000 0.0000
0.2018 0.0000 0.0000 0.0000
0.3027 0.0000 0.0000 0.0000
0.4036 0.0000 0.0000 0.0000
0.5045 0.0000 0.0000 0.0000
0.6054 0.0000 0.0000 0.0000
0.7063 0.0000 0.0000 0.0000
0.8071 0.0000 0.0000 0.0000
0.9080 0.0000 0.0000 0.0000
1.0089 0.0000 0.0000 0.0000
1.1098 0.0000 0.0000 0.0000
1.2107 0.0000 0.0000 0.0000
1.3116 0.0000 0.0000 0.0000
1.4125 0.0017 0.0000 0.0000
1.5134 0.0186 0.0001 0.0001
1.6143 0.0740 0.0005 0.0006
1.7152 0.2113 0.0014 0.0020
1.8161 0.7254 0.0047 0.0067
1.9170 2.3414 0.0153 0.0220
2.0179 4.6975 0.0307 0.0527
2.1187 6.5842 0.0430 0.0957
2.2196 6.0677 0.0396 0.1354
2.3205 8.5759 0.0560 0.1914
2.4214 11.7745 0.0769 0.2683
2.5223 12.4957 0.0816 0.3500
2.6232 13.0301 0.0851 0.4351
2.7241 11.1008 0.0725 0.5076
2.8250 11.4504 0.0748 0.5824
2.9259 12.6537 0.0827 0.6651
3.0268 12.1584 0.0794 0.7445
3.1277 11.0262 0.0720 0.8165
3.2286 6.8917 0.0450 0.8616
3.3295 5.8852 0.0384 0.9000
3.4304 6.4879 0.0424 0.9424
3.5312 5.0121 0.0327 0.9751
3.6321 2.7019 0.0177 0.9928
3.7330 0.9148 0.0060 0.9988
3.8339 0.1544 0.0010 0.9998
3.9348 0.0287 0.0002 1.0000
4.0357 0.0053 0.0000 1.0000
4.1366 0.0002 0.0000 1.0000
4.2375 0.0000 0.0000 1.0000
4.3384 0.0000 0.0000 1.0000
4.4393 0.0000 0.0000 1.0000
4.5402 0.0000 0.0000 1.0000
4.6411 0.0000 0.0000 1.0000
4.7420 0.0000 0.0000 1.0000
4.8429 0.0000 0.0000 1.0000
4.9437 0.0000 0.0000 1.0000
5.0446 0.0000 0.0000 1.0000
5.1455 0.0000 0.0000 1.0000
5.2464 0.0000 0.0000 1.0000
5.3473 0.0000 0.0000 1.0000
5.4482 0.0000 0.0000 1.0000
5.5491 0.0000 0.0000 1.0000
lower=2.01786 upper=3.53125
To be more accurate, one would want to interpolate between adjacent values to get the 2.5% and 97.5% limits. You mentioned, however, that your actual dataset has many more data points. In that case, interpolation is superfluous complication.
How it works:
-v "sum=$sum" -v lower=N -v upper=N
Here we define three variables to be used by awk. Note that we define sum here as an awk variable. That allows us to use sum in the awk formulas without the complication of mixing shell variable expansion in with awk code.
runsum += $2; cdf=runsum/sum;
Just as you had it, we compute the running sum, runsum, and the cumulative probability distribution, cdf.
printf "%10.4f %10.4f %10.4f %10.4f",$1,$2,$2/sum,cdf; print ""
Here we print out each line. I took the liberty here of changing the format to something that prints pretty. If you need tab-separated values, then change this back.
lower=="N" && cdf>0.025{lower=$1}
If we have not previously reached the lower confidence limit, then lower is still equal to N. If that is the canse and the current cdf is now greater than 0.025, we set lower to the current value of x.
upper=="N" && cdf>0.975 {upper=$1}
This does the same for the upper confidence limit.
END{printf "lower=%s upper=%s\n",lower,upper}
At the end, this prints the lower and upper confidence limits.

Awk While and For Loop

I have two files (file1 and file2)
file1:
-11.61
-11.27
-10.47
file2:
NAME
NAME
NAME
I want to use awk to search for first occurrence of NAME in file 2 and add the 1st line of file1 before it, and so on. The desired output is
########## Energy: -11.61
NAME
########## Energy: -11.27
NAME
########## Energy: -10.47
NAME
I tried this code
#!/bin/bash
file=file1
while IFS= read line
do
# echo line is stored in $line
echo $line
awk '/MOLECULE/{print "### Energy: "'$line'}1' file2` > output
done < "$file"
But this was the output that I got
########## Energy: -10.47
NAME
########## Energy: -10.47
NAME
########## Energy: -10.47
NAME
I don't know why the script is putting only the last value of file1 before each occurrence of NAME in file2.
I appreciate your help!
Sorry if I wasn't clear in my question. Here are the samples of my files (energy.txt and sample.mol2):
[user]$cat energy.txt
-11.61
-11.27
-10.47
[user]$cat sample.mol2
#<TRIPOS>MOLECULE
methane
5 4 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 C 2.8930 -0.4135 -1.3529 C.3 1 <1> 0.0000
2 H1 3.9830 -0.4135 -1.3529 H 1 <1> 0.0000
3 H2 2.5297 0.3131 -0.6262 H 1 <1> 0.0000
4 H3 2.5297 -1.4062 -1.0869 H 1 <1> 0.0000
5 H4 2.5297 -0.1476 -2.3456 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
#<TRIPOS>MOLECULE
ammonia
4 3 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 N 8.6225 -3.5397 -1.3529 N.3 1 <1> 0.0000
2 H1 9.6325 -3.5397 -1.3529 H 1 <1> 0.0000
3 H2 8.2858 -2.8663 -0.6796 H 1 <1> 0.0000
4 H3 8.2858 -4.4595 -1.1065 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
3 1 4 1
#<TRIPOS>MOLECULE
water
3 2 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 O 7.1376 3.8455 -3.4206 O.3 1 <1> 0.0000
2 H1 8.0976 3.8455 -3.4206 H 1 <1> 0.0000
3 H2 6.8473 4.4926 -2.7736 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
This is the output that I need
########## Energy: -11.61
#<TRIPOS>MOLECULE
methane
5 4 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 C 2.8930 -0.4135 -1.3529 C.3 1 <1> 0.0000
2 H1 3.9830 -0.4135 -1.3529 H 1 <1> 0.0000
3 H2 2.5297 0.3131 -0.6262 H 1 <1> 0.0000
4 H3 2.5297 -1.4062 -1.0869 H 1 <1> 0.0000
5 H4 2.5297 -0.1476 -2.3456 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
########## Energy: -11.27
#<TRIPOS>MOLECULE
ammonia
4 3 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 N 8.6225 -3.5397 -1.3529 N.3 1 <1> 0.0000
2 H1 9.6325 -3.5397 -1.3529 H 1 <1> 0.0000
3 H2 8.2858 -2.8663 -0.6796 H 1 <1> 0.0000
4 H3 8.2858 -4.4595 -1.1065 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
3 1 4 1
########## Energy: -10.47
#<TRIPOS>MOLECULE
water
3 2 1 0 0
SMALL
NO_CHARGES
#<TRIPOS>ATOM
1 O 7.1376 3.8455 -3.4206 O.3 1 <1> 0.0000
2 H1 8.0976 3.8455 -3.4206 H 1 <1> 0.0000
3 H2 6.8473 4.4926 -2.7736 H 1 <1> 0.0000
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
paste -d "\n" <(sed 's/^/########## Energy: /' file1) file2
########## Energy: -11.61
NAME
########## Energy: -11.27
NAME
########## Energy: -10.47
NAME
Or, sticking with awk
awk '{
print "########## Energy: " $0
getline < "file2"
print
}' file1
Using awk:
awk 'NR==FNR{a[NR]=$0;next} /#<TRIPOS>MOLECULE/
{print "########## Energy: ", a[++i]}1' energy.txt sample.mol2
Explanation:
FNR - line number of the current file
NR - line number of the total lines of two files.
NR==FNR{a[NR]=$0;next} is applied for the first energy.txt
so above statement populates an array with index as 1,2,3... and value as $0
/#<TRIPOS>MOLECULE/ search is executed on the 2nd file sample.mol2
When above search is successful it prints quoted static string and a line from array created from 1st file
++i moves the counter to next element in array after printing

Resources