Script to copy all text from one file into template file - bash

I'm extremely new to scripting (this will be the first I've ever written on my own), and I'm struggling...
Essentially, a program I'm using outputs a multiple lists of atomic cartesian coordinates for a molecule, where each set of coordinates contains a slightly different geometry. I then want to run calculations on this molecule at each of these geometries in another program. To do this, I need to create input files containing these coordinates based on a template file, and am hoping to do this via a bash or zsh script.
The first program outputs all geometries in a single file, of the form:
13
-15.02035015
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
13
-15.02029090
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
etc...
essentially, one line where the number of atoms is written (always the same number within a file, but will depend on the molecule you are interested in [3 if I'm looking at water, H2O; 5 for methane, CH4; 4 for ammonia, NH3; etc.]), one comment line (in the case of this program, the energies are written there) and then the cartesian coordinates, followed directly by the next set of coordinates. In my test file, there are 49 sets of coordinates.
The template file will look something like this:
#Comment line, molecule number CONF_NUMBER
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
COORDINATES
*
So, ideally I would end up writing a script which would take the coordinates of each molecule from the coordinates file and generate an input file for every listed geometry based on the template, replacing the COORDINATES text in the template with the one of the geometries (and, if possible, to include a number in the first comment line, replacing CONF_NUMBER with a number matching the directory and file name):
~/c1/molecule-c1-name.inp:
#Comment line, molecule number 1
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
O 1.0 23.21 1.1
H 2.0 2.90 1.1
H 3.0 2.33 1.1
*
~/c2/molecule-c2-name.inp:
#Comment line, molecule number 2
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
O 2.0 23.21 1.1
H 3.0 2.70 1.43
H 2.0 2.33 1.1
*
etc...
So far, I've been able to break up the individual geometries into separate each geometry into individual coordinate files and even remove the two lines above each geometry (which are not needed). Unfortunately, I cannot find a way to copy the whole geometry into the template; I'm stuck in a position where only a single line from the coordinate file is copies over per input file. The code I've got so far is:
#!/bin/zsh
input_file=$1
#arguments
while getopts t:j:p:s: flag
do
case "${flag}" in
t) template_file=${OPTARG};;
j) job_file=${OPTARG};;
p) prefix=${OPTARG};;
s) suffix=${OPTARG};;
esac
done
#Determine number of atoms
n_atoms=$(sed -n 1p $1)
#Determine number of lines to separate
splitlines=$(echo $n_atoms | awk '{print ($0+2)}')
#determine number of conformers
n_conformers=$(grep -c "$n_atoms" $1)
#echo $splitlines
#Split the coordinates into individual .xyz files
split -dl $splitlines $1 coords
#Rename coordinate files
for file in coords*
do
sed -i -e 1,2d $file
mv "$file" "$file.xyz"
done
rm *-e
#Copy coordinates into template file
n=1
while read file
do
sed -i "" "s/COORDINATES/"$file"/r" template.inp > ea.h2s-input${n}.inp
((n++))
done < coords00.xyz #first coordinate file produced
In this example, coords00.xyz is the first of the separated coordinate files the script generates, template.inp is the template file, and ea.h2s-input${n}.inp is the name of the resulting input file (which I will later make customisable with arguments, hopefully).
Bear in mind that, during testing, I've been trying to get the simple things working, so this script is only written to get the first geometry copied into the template file (hence why the files are named explicitly, rather than as variables - although I'm hoping to use the arguments at the beginning of the script to help name each resulting input file), but I can't even get that to work!
Unfortunately, all other forum posts I've found only talk about copying small bits of text (a name, a word, one line of text) into templates - never multiple lines, let alone the entire contents of a file.
I have tried everything I can think of to get this to work, and this script is as close as I have gotten, but I cannot figure out how to print all of the coordinate lines into the template. Any help would be greatly appreciated!

If you are open to using awk:
awk -v tmpl="$(<src.tmpl)" 'BEGIN{cnt=1} \
NR==1 {n_atoms=$1} \
NF==1 {flag=1; close(out); out=sprintf("coords""%02d"".xyz", cnt)} \
{if(flag==0) {print $0 > out; coord_cnt++}} \
{if(coord_cnt==n_atoms){printf "*\n" > out; coord_cnt=0}}\
{if (NF==1 && $1!=n_atoms) { flag=0; \
printf "#Comment line, molecule number %s\n%s\n", cnt, tmpl > out; cnt+=1}}' src.txt
template file contents should look like: (newlines escaped)
# \
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym \
%method \
functional mgga_xc_b97m_v \
end \
etc etc etc... \
\
* xyz 0 1
src.txt contents:
13
-15.02035015
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
13
-15.02029090
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
Output: (note output files numbering starts with '01')
$ ls coords0*
coords01.xyz coords02.xyz
$ cat coords0*
#Comment line, molecule number 1
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
*
#Comment line, molecule number 2
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
*

Related

Using sed to add line above a set of lines

EDIT BELOW
I'm new to bash scripting, sorry if this has been answered elsewhere, couldn't find it in any searches I've done.
I'm using sed -i to add a line above an argument, for example.
for EFP in *.inp; do
sed -i "/^O */i FRAGNAME=H2ODFT" $EFP
done
and it works as expected. but I would like it to only add the line when the argument is true across multiple lines, like so:
O
C
O
C
FRAGNAME=H2ODFT
O
H
H
FRAGNAME=H2ODFT
O
H
H
Notice there's no added line above the two O's that are followed by C's.
I tried the following:
for FILE in *.inp; do
sed -i "/^O*\nH*\nH */i FRAGNAME=H2ODFT" $EFP
done
and I was expecting it to show up above the 3 lines that went O - H - H, but nothing happened, it passed through the file thinking that that argument was nowhere in the document.
I've looked elsewhere and thought of using awk, but I can't wrap my head around it.
Any help would be greatly appreciated!
L
EDIT
Thanks for the help. And sorry for being a bit unclear. I've tried a ton of things, too many to put in this post. I've tried awk, perl and sed solutions, but they're not working.
My input has a series of O's C's and H's, with cartesian coordinates assigned to them:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
I'm trying to input a new line above a specific set of three lines, the OHH lines.
The awk solution posted didn't work, because it would add extra lines where there shouldn't be when the stage gets reset. I'm looking for the following output:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
The ^tsed was a typo and should've been an indent instead of ^t
Here is a ruby to do that:
ruby -e 'lines=$<.read.split(/\R/)
lines.each_with_index{|line,i|
three_line_tag=lines[i..i+2].map{|sl| sl.split[0] }.join
puts "FRAGNAME=H2ODFT" if three_line_tag == "OHH"
puts line
}
' file
Or any awk, same kind of method:
awk '{lines[NR]=$0}
END{
for(i=1;i<=NR;i++) {
tag=""
for(j=0;j<=2;j++) {
split(lines[i+j],arr)
tag=tag arr[1]
}
if (tag=="OHH")
print "FRAGNAME=H2ODFT"
print lines[i]
}
}
' file
Or Perl:
perl -0777 -pe 's/(^\h*O\h.*\R^\h*H\h.*\R^\h*H\h.*\R?)/FRAGNAME=H2ODFT\n\1/gm' file
Any print:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
===
Edit in place:
Read THIS about awk and that is generally applicable.
Any of these scripts as written write to stdout.
You can redirect the output to a new file:
someutility input_file >new_file
Or some (like perl, ruby, GNU awk, GNU sed) have the ability to do in-place file replacement. If you don't have that option, you cannot do:
someutil 'prints to STDOUT' file >file
since file will be destroyed before fully read.
Instead you would do:
someutil 'prints to STDOUT' file > tmp && mv tmp file
This might work for you (GNU sed):
sed -Ei -e ':a;N;s/\n/&/2;Ta;/^O(\n.)\1$/i FRAGNAME=H2ODFT' -e 'P;D' file1 file2
Open a 3 line window throughout the file and if the required pattern matches, insert the line of the desired text.
N.B. The \1 back reference matches the line before. Also the script is in two separate pieces because the i command requires to end in a newline which the -e option provides.
An alternative version of the same solution:
cat <<\! | sed -Ef - -i file{1..100}
:a
N
s/\n/&/2
Ta
/^O(\n.)\1$/i FRAGNAME=H2ODFT
P
D
!
If input files aren't large to cause memory issues, you can slurp the entire file and then perform the substitution. For example:
perl -0777 -pe 's/^O\nH\nH\n/FRAGNAME=H2ODFT\n$&/gm' ip.txt
If this works for you, then you can add the -i option for inplace editing. The regex ^O*\nH*\nH * shown in the question isn't clear. ^O\nH\nH\n will match three lines having O, H and H exactly. Adjust as needed.
I know you requested a sed solution, but, I have a solution based on awk.
We initialize the awk program with a stage which, overtime, will track the progress of "OHH"
If we receive another letter, we grow the stage until we get OHH, then, we print your required string and reset the stage
If we encounter a breakage, we print out whatever we accumulated in stage and reset stage
awk '
BEGIN { stage="" }
/^O$/ { if (stage=="") { stage="O\n"; next } }
/^H$/ { if (stage=="O\n") { stage="O\nH\n"; next } }
/^H$/ { if (stage=="O\nH\n") { print "FRAGNAME=H20DFT\nO\nH\nH"; stage=""; next } }
{ print stage $1; stage="" }
' < sample.txt
Where sample.txt contains:
O
C
O
C
O
H
H
O
H
H

Parsing multiple instances of data

I am trying to parse multiple instances of data from a textfile. I can grep and grab one line and the lat/lon associated with that find, but I am having issued parsing multiple instances:
... CATEGORICAL ...
SLGT 33618675 34608681 35658642 36668567 38218542 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33618675
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31998226 31698419 32078601
32818733 33848809 34758764 36998623 38588677 39458701
40178757 40608870 41069099 43549479 44499512 44809478
45259379 44989263 45109100 45718986 46478920 46758853
46738752 46398664 44768565 44308457 43198218
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 29739808 29299723 28969581
28959440 99999999 26769674 26579796 26139874
TSTM 45077438 43177245 40597113 99999999 30488085 30248563
29588926 28739072 28569092 99999999 27138160 27578139
27908100 27848061 27518032 26968006 26338005 25698017
25338025 25088048 25058071 25238109 25578128 25888157
26218171 26578170 26988163 27138160 99999999 29200399
31910374 33520340 35190229 35450147 36109944 36399709
35779395 36399167 38559059 40189373 41729594 43029985
42820283 42860489 43580863 44121062 44521135 45281179
46271166 47561286 48251548 48671765 49051814 99999999
38810245 37660271 37120322 36950398 37090559 37380662
38090741 39410791 39980777 40930695 41380598 41370510
41190353 40840299 40220263 38810245
From: https://www.spc.noaa.gov/products/outlook/archive/2019/KWNSPTSDY1_201906241300.txt
Here is my code and results:
#!/bin/sh
sed -n '/^MRGL/,/^TSTM/p;/^TSTM/q' day1_status | sed '$ d' | sed -e 's/MRGL//g' > MRGL
while read line
do
count=1
ncols=$(echo $line | wc -w)
while [ $count -le $ncols ]
do
echo $line | cut -d' ' -f$count
((count++))
done
done < MRGL > MRGL_output.txt
cat MRGL_output.txt | sed ':a;s/\B[0-9]\{2\}\>/.&/;ta'| sed 's/./, -/6' > MRGL_final
Results:
one instance of MRGL and the lat/lon associated with that polygon
more MRGL
32947889 34137855 35307825 36147735 36327622 35797468
27107968 25518232 99999999 27088303 28418215 30208125
30618064
Turn the line above into a single instance of lines
more MRGL_output.txt
32947889
34137855
35307825
36147735
36327622
35797468
27107968
25518232
99999999
27088303
28418215
30208125
30618064
Final format that I need it in
more MRGL_final
32.94, -78.89
34.13, -78.55
35.30, -78.25
36.14, -77.35
36.32, -76.22
35.79, -74.68
27.10, -79.68
25.51, -82.32
99.99, -99.99
27.08, -83.03
28.41, -82.15
30.20, -81.25
30.61, -80.64
Just need to parse multiple instances that show up.
UPDATE for better explanation.
... CATEGORICAL ...
ENH 38298326 40108202 40518094 40357974 39907953 39017948
38038052 36148202 35848297 35888367 36618371 38298326
SLGT 30440169 31710202 33010185 33730148 34010037 33999962
33709892 32869871 30979883 29539912 29430025 30440169
SLGT 33548672 34408661 35918543 36858496 38648520 41018363
41588227 41918045 41377903 40177805 38927813 37817869
36678030 35068154 33368262 33078321 32888462 33548672
SLGT 41788755 41698893 42069059 42639132 43889124 44438960
44438757 43988717 43278708 42398720 41788755
MRGL 29720174 31900221 33650181 34160154 34430032 34649931
34159800 32539784 31359767 30059748 29299723 28969581
28959440 99999999 26769674 26579796 26139874
MRGL 42897922 41907743 40147624 38837627 37637700 35897915
35028021 34038079 33118130 31938225 30758424 30678620
30988709 34128741 36208583 37738554 39508601 40628878
41069099 43549479 44499512 44809478 45259379 44989263
45109100 45718986 46478920 46758853 46738752 46398664
44768565 44308457 43198218
TSTM 30488085 29978211 29408316 29068379 99999999 27138160
27578139 27908100 27848061 27518032 26968006 26338005
25698017 25338025 25088048 25058071 25238109 25578128
25888157 26218171 26578170 26988163 27138160 99999999
45427410 43217292 40247181 99999999 28650405 31910374
33520340 35190229 35450147 36109944 36399709 35779395
36769245 38319148 40189373 41219571 41299753 39959979
38220054 37320091 36560136 36070290 36100295 35840394
36790544 37150626 37880709 39110774 40120876 41150895
41600769 41890540 43070599 43580863 43390914 43401262
44171458 45521497 46131301 47181242 47561286 48251548
48671765 49371856
Wanting to take this data set above and grab each available risk ENH, SLGT, MRGL, TSTM lat and long and place into this format:
"Enhanced Risk"
38.29, -83.26
40.10, -82.02
40.51, -80.94
40.35, -79.74
39.90, -79.53
39.01, -79.48
38.03, -80.52
36.14, -82.02
35.84, -82.97
35.88, -83.67
36.61, -83.71
38.29, -83.26
End:
"Slight Risk"
30.44, -101.69
31.71, -102.02
33.01, -101.85
33.73, -101.48
34.01, -100.37
33.99, -99.62
33.70, -98.92
32.86, -98.71
30.97, -98.83
29.53, -99.12
29.43, -100.25
30.44, -101.69
End:
"Slight Risk"
33.54, -86.72
34.40, -86.61
35.91, -85.43
36.85, -84.96
38.64, -85.20
41.01, -83.63
41.58, -82.27
41.91, -80.45
41.37, -79.03
40.17, -78.05
38.92, -78.13
37.81, -78.69
36.67, -80.30
35.06, -81.54
33.36, -82.62
33.07, -83.21
32.88, -84.62
33.54, -86.72
End:
"Slight Risk"
41.78, -87.55
41.69, -88.93
42.06, -90.59
42.63, -91.32
43.88, -91.24
44.43, -89.60
44.43, -87.57
43.98, -87.17
43.27, -87.08
42.39, -87.20
41.78, -87.55
End:
"Marginal Risk"
29.72, -101.74
31.90, -102.21
33.65, -101.81
34.16, -101.54
34.43, -100.32
34.64, -99.31
34.15, -98.00
32.53, -97.84
31.35, -97.67
30.05, -97.48
29.29, -97.23
28.96, -95.81
28.95, -94.40
26.76, -96.74
26.57, -97.96
26.13, -98.74
End:
Here's a little awk program which seems to work, although I'm not certain about some of the details. In particular, I don't know what the minimum value for longitude is; evidently, a value under the minimum has 100 added to it before the longitude is negated. So you'll have to change LON_THRESHOLD to what you consider the correct value.
I've tried to avoid the usual temptation to golf awk programs into a textual minimum, in the hopes that the way this program works is less obscure. But it's entirely possible that some awkisms snuck in anyway. I added a bit of explanation at the end.
BEGIN { risk["HIGH"] = "High Risk"
risk["ENH"] = "Enhanced Risk"
risk["SLGT"] = "Slight Risk"
risk["MRGL"] = "Marginal Risk"
LON_THRESHOLD = 30
END_STRING = "End:"
}
END { if (in_risk) print END_STRING }
in_risk && substr($0, 1, 1) != " " {
print END_STRING "\n" "\n"
in_risk = 0
}
$1 in risk { printf("\"%s\"\n", risk[$1])
in_risk = 2
}
in_risk { for (i = in_risk; i <= NF; ++i) {
lat = substr($i, 1, 4) / 100
lon = substr($i, 5, 4) / 100
if (lon < LON_THRESHOLD) lon += 100
printf "%5.2f, %.2f\n", lat, -lon
}
in_risk = 1
}
Save that program as, for example, noaa.awk, and then apply it with:
awk -f noaa.awk input.txt
By way of explanation:
Awk programs consist of a series of rules. Each rule has a predicate -- that is, an expression which evaluates to a true or false value -- and an action.
Awk processes each line from its input in turn, running through all of the rules and executing the actions of the ones whose predicates evaluate to a true value. Inside the action, you can use the $ operator to access individual fields in the input (by default, fields are separated with whitespace). $0 is the entire input line, and $n is field number n. Unlike bash/sh, $ is an operator and can be applied to an expression.
BEGIN and END rules are special, in that they are not real variables. BEGIN rules are executed exactly once, before any other processing; END rules are executed exactly once after all processing is finished. In this example, as is common, BEGIN is used to initialise reference data, while END is used for any necessary termination -- in this case, printing the final End: line.
In cases like this, where the desired action is really dependent on where we are in the file, it's necessary to build some kind of state machine, and I did that using the variable in_risk, which has three possible values:
0 or undefined: We're not currently in a block corresponding to a risk selector.
1: The current line, if it starts with a space, is part of a previously identified risk selector.
2: The current line has been detected as starting with a risk selector.
The reason for the difference between the last two values is that $1 in a line which starts with a risk selector is the risk selector, whereas in a line which starts with a space, $1 is actually the first number. So when we're iterating over the numbers in a line, we have to start with $2 for lines which start with a risk selector.
If you're just asking how to turn a file of lines of like AABBCCDD into lines like AA.BB, -CC.DD:
perl -nE '/^(..)(..)(..)(..)$/ && say "$1.$2, -$3.$4"' MRGL_output.txt
(There's almost certainly better ways to get from your original input to those lines, but I'm not really clear on what your posted code is doing or why)
I think this will process your original input correctly, but can't be sure because the numbers in your sample output don't match up with your sample input so I can't verify:
perl -anE 'if (/^MRGL/ .. /^TSTM/) { exit if /^TSTM/; push #nums, #F }
END { for (#nums) {
if (/^(..)(..)(..)(..)$/) { say "$1.$2, -$3.$4" }
}}' day1_status
Got GNU Awk?
awk -v RS='\\s+' '
/[A-Z]/ {p = /^MRGL$/? 1: 0; next}
p {print gensub(/(..)(..)(..)(..)/, "\\1.\\2, -\\3.\\4", "G")}
' file
-v RS'\\s+' - Use any amount of whitespace as the Record Separator
/[A-Z]/ {...} - On records with uppercase alphabetics, do
p = /^MRGL$/? 1: 0; next - Set flag if record is MRGL, else unset, but always skip any other rules.
p {print gensub(...)} - Print result of gensub if flag is set
/(...)/, "\\1", "G" - Capturing groups, Backreferences, Global substitution.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

sorting a file of list created by python with write

I have a file created by python3 using:
of.write("{:<6f} {:<10f} {:<18f} {:<10f}\n"
.format((betah), test, (torque*13605.698066), (mom)))
The output file looks like:
$cat pout
15.0 47.13 0.0594315908872 0.933333333334
25.0 29.07 0.143582198404 0.96
20.0 35.95 0.220373446813 0.95
5.0 124.12 0.230837577743 0.800090803982
4.0 146.71 0.239706979471 0.750671150402
0.5 263.24 0.239785533064 0.163953413739
1.0 250.20 0.240498520899 0.313035285499
Now, I want to sort the list.
The expected output of sorting will be:
25.0 29.07 0.143582198404 0.96
20.0 35.95 0.220373446813 0.95
15.0 47.13 0.0594315908872 0.933333333334
5.0 124.12 0.230837577743 0.800090803982
4.0 146.71 0.239706979471 0.750671150402
1.0 250.20 0.240498520899 0.313035285499
0.5 263.24 0.239785533064 0.163953413739
I tried this and tuples example in this but they are yielding the output as
['0.500000 263.240000 0.239786 0.163953 \n', '15.000000 47.130000 0.059432 0.933333 \n', '1.000000 250.200000 0.240499 0.313035 \n', '25.000000 29.070000 0.143582 0.960000 \n', '20.000000 35.950000 0.220373 0.950000 \n', '4.000000 146.710000 0.239707 0.750671 \n', '5.000000 124.120000 0.230838 0.800091 \n']
Please, don't try to match the numbers of input and output, because both of them are truncated for brevity.
As an example for my own try for the sorting with help from 1 is like:
f = open("tmp", "r")
lines = [line for line in f if line.strip()]
print(lines)
f.close()
Kindly help me sorting the file properly.
The problem you've found is that the strings are sorted alphabetically instead of numerically. What you need to do is convert each item from a string to a float, sort the list of floats, and then output as a string again.
I've recreated your file here, so you can see that I'm reading directly from a file.
pout = [
"15.0 47.13 0.0594315908872 0.933333333334",
"25.0 29.07 0.143582198404 0.96 ",
"20.0 35.95 0.220373446813 0.95 ",
"5.0 124.12 0.230837577743 0.800090803982",
"4.0 146.71 0.239706979471 0.750671150402",
"0.5 263.24 0.239785533064 0.163953413739",
"1.0 250.20 0.240498520899 0.313035285499"]
with open('test.txt', 'w') as thefile:
for item in pout:
thefile.write(str("{}\n".format(item)))
# Read in the file, stripping each line
lines = [line.strip() for line in open('test.txt')]
acc = []
# Loop through the list of lines, splitting the numbers at the whitespace
for strings in lines:
words = strings.split()
# Convert each item to a float
words = [float(word) for word in words]
acc.append(words)
# Sort the new list, reversing because you want highest numbers first
lines = sorted(acc, reverse=True)
# Save it to the file.
with open('test.txt', 'w') as thefile:
for item in lines:
thefile.write("{:<6} {:<10} {:<18} {:<10}\n".format(item[0], item[1], item[2], item[3]))
Also note that I use with open('test.txt', 'w') as thefile: as it automatically handles all opening and closing. Much more memory-safe.

Calculate sum of size notated figures?

I want to calculate the total size of all .mobi files from this
link (it's a good link by the way).
In my attempt of making this as my learning experience, I have made a 'pipe' (let's call it a) that output all the sizes from that page which looks like:
189K
20M
549K
2.2M
1.9M
3.1M
2.5M
513K
260K
1.1M
2.8M
5.1M
3.7M
1.5M
5.6M
1.0M
5.6M
1.5M
4.9M
3.4M
810K
My target is to get the total size (ex: 50.50M, or 50000K) - sum of all these numbers.
My question is, how to calculate that target, using pipeling (a | some_other_commands). Answers using python or any other language (preferably one liners) are welcome. Thanks a lot.
For the fun a solution in shell:
a | sed -e 's/M$/ 1024 * +/' -e 's/K$/ +/' | dc -e '0' -f - -e 'p'
Perl one-liner:
a | perl -ne 's/^([\d.]+)M$/$1*1024/e;$sum+=$_; END{print $sum."K"}'
see it
It assumes that all entries are in either Kilobytes or Megabytes as shown in OPs input.
Sigh, someone says “one-liner” and all my code-golf reflexes fire...
ruby -e 'puts $<.read.split.inject(0){ |m,e| m += e.to_f * { "M" => 1, "K" => 0.001 }[e[-1,1]]}.to_s+"M"'
or, with some shortcuts...
ruby -ne 'p #e=#e.to_f+$_.to_f*{"M"=>1,"K"=>0.001}[$_[-2,1]]'
Update: Heh, ok, hard to read. The OP asked for a "one liner". :-)
#!/usr/bin/env ruby
total = 0
while s = gets # get line
scalefactorMK = s.chomp[-1,1] # get the M or K
scalefactor = { 'M'=>1,'K'=>0.001 }[scalefactorMK] # get numeric scale
total += s.to_f * scalefactor # accumulate total
end
puts "%5.1fM" % [total]
if you have Ruby (1.9+)
require 'net/http'
url="http://hewgill.com/~greg/stackoverflow/ebooks/"
response = Net::HTTP.get_response( URI.parse(url) )
data=response.body
total=0
data.split("\n").each do |x|
if x=~/\.mobi/
size = x.split(/\s+/)[-1]
c = case size[-1]
when 'K' then 1024
when 'M' then 1024 * 1024
when 'G' then 1024 * 1024 * 1024
end
total+=size[0..-1].to_i * c
end
end
puts "Total size: %.2f MB" % ( total/(1024.0 * 1024.0) )
awk (assume files less than 1K don't substantially add to the total):
a | awk '/K/ {sum += $1/1024} /M/ {sum += $1} END {printf("%.2fM\n", sum)}'

Resources