EDIT BELOW
I'm new to bash scripting, sorry if this has been answered elsewhere, couldn't find it in any searches I've done.
I'm using sed -i to add a line above an argument, for example.
for EFP in *.inp; do
sed -i "/^O */i FRAGNAME=H2ODFT" $EFP
done
and it works as expected. but I would like it to only add the line when the argument is true across multiple lines, like so:
O
C
O
C
FRAGNAME=H2ODFT
O
H
H
FRAGNAME=H2ODFT
O
H
H
Notice there's no added line above the two O's that are followed by C's.
I tried the following:
for FILE in *.inp; do
sed -i "/^O*\nH*\nH */i FRAGNAME=H2ODFT" $EFP
done
and I was expecting it to show up above the 3 lines that went O - H - H, but nothing happened, it passed through the file thinking that that argument was nowhere in the document.
I've looked elsewhere and thought of using awk, but I can't wrap my head around it.
Any help would be greatly appreciated!
L
EDIT
Thanks for the help. And sorry for being a bit unclear. I've tried a ton of things, too many to put in this post. I've tried awk, perl and sed solutions, but they're not working.
My input has a series of O's C's and H's, with cartesian coordinates assigned to them:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
I'm trying to input a new line above a specific set of three lines, the OHH lines.
The awk solution posted didn't work, because it would add extra lines where there shouldn't be when the stage gets reset. I'm looking for the following output:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
The ^tsed was a typo and should've been an indent instead of ^t
Here is a ruby to do that:
ruby -e 'lines=$<.read.split(/\R/)
lines.each_with_index{|line,i|
three_line_tag=lines[i..i+2].map{|sl| sl.split[0] }.join
puts "FRAGNAME=H2ODFT" if three_line_tag == "OHH"
puts line
}
' file
Or any awk, same kind of method:
awk '{lines[NR]=$0}
END{
for(i=1;i<=NR;i++) {
tag=""
for(j=0;j<=2;j++) {
split(lines[i+j],arr)
tag=tag arr[1]
}
if (tag=="OHH")
print "FRAGNAME=H2ODFT"
print lines[i]
}
}
' file
Or Perl:
perl -0777 -pe 's/(^\h*O\h.*\R^\h*H\h.*\R^\h*H\h.*\R?)/FRAGNAME=H2ODFT\n\1/gm' file
Any print:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
===
Edit in place:
Read THIS about awk and that is generally applicable.
Any of these scripts as written write to stdout.
You can redirect the output to a new file:
someutility input_file >new_file
Or some (like perl, ruby, GNU awk, GNU sed) have the ability to do in-place file replacement. If you don't have that option, you cannot do:
someutil 'prints to STDOUT' file >file
since file will be destroyed before fully read.
Instead you would do:
someutil 'prints to STDOUT' file > tmp && mv tmp file
This might work for you (GNU sed):
sed -Ei -e ':a;N;s/\n/&/2;Ta;/^O(\n.)\1$/i FRAGNAME=H2ODFT' -e 'P;D' file1 file2
Open a 3 line window throughout the file and if the required pattern matches, insert the line of the desired text.
N.B. The \1 back reference matches the line before. Also the script is in two separate pieces because the i command requires to end in a newline which the -e option provides.
An alternative version of the same solution:
cat <<\! | sed -Ef - -i file{1..100}
:a
N
s/\n/&/2
Ta
/^O(\n.)\1$/i FRAGNAME=H2ODFT
P
D
!
If input files aren't large to cause memory issues, you can slurp the entire file and then perform the substitution. For example:
perl -0777 -pe 's/^O\nH\nH\n/FRAGNAME=H2ODFT\n$&/gm' ip.txt
If this works for you, then you can add the -i option for inplace editing. The regex ^O*\nH*\nH * shown in the question isn't clear. ^O\nH\nH\n will match three lines having O, H and H exactly. Adjust as needed.
I know you requested a sed solution, but, I have a solution based on awk.
We initialize the awk program with a stage which, overtime, will track the progress of "OHH"
If we receive another letter, we grow the stage until we get OHH, then, we print your required string and reset the stage
If we encounter a breakage, we print out whatever we accumulated in stage and reset stage
awk '
BEGIN { stage="" }
/^O$/ { if (stage=="") { stage="O\n"; next } }
/^H$/ { if (stage=="O\n") { stage="O\nH\n"; next } }
/^H$/ { if (stage=="O\nH\n") { print "FRAGNAME=H20DFT\nO\nH\nH"; stage=""; next } }
{ print stage $1; stage="" }
' < sample.txt
Where sample.txt contains:
O
C
O
C
O
H
H
O
H
H
Related
I'm extremely new to scripting (this will be the first I've ever written on my own), and I'm struggling...
Essentially, a program I'm using outputs a multiple lists of atomic cartesian coordinates for a molecule, where each set of coordinates contains a slightly different geometry. I then want to run calculations on this molecule at each of these geometries in another program. To do this, I need to create input files containing these coordinates based on a template file, and am hoping to do this via a bash or zsh script.
The first program outputs all geometries in a single file, of the form:
13
-15.02035015
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
13
-15.02029090
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
etc...
essentially, one line where the number of atoms is written (always the same number within a file, but will depend on the molecule you are interested in [3 if I'm looking at water, H2O; 5 for methane, CH4; 4 for ammonia, NH3; etc.]), one comment line (in the case of this program, the energies are written there) and then the cartesian coordinates, followed directly by the next set of coordinates. In my test file, there are 49 sets of coordinates.
The template file will look something like this:
#Comment line, molecule number CONF_NUMBER
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
COORDINATES
*
So, ideally I would end up writing a script which would take the coordinates of each molecule from the coordinates file and generate an input file for every listed geometry based on the template, replacing the COORDINATES text in the template with the one of the geometries (and, if possible, to include a number in the first comment line, replacing CONF_NUMBER with a number matching the directory and file name):
~/c1/molecule-c1-name.inp:
#Comment line, molecule number 1
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
O 1.0 23.21 1.1
H 2.0 2.90 1.1
H 3.0 2.33 1.1
*
~/c2/molecule-c2-name.inp:
#Comment line, molecule number 2
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
O 2.0 23.21 1.1
H 3.0 2.70 1.43
H 2.0 2.33 1.1
*
etc...
So far, I've been able to break up the individual geometries into separate each geometry into individual coordinate files and even remove the two lines above each geometry (which are not needed). Unfortunately, I cannot find a way to copy the whole geometry into the template; I'm stuck in a position where only a single line from the coordinate file is copies over per input file. The code I've got so far is:
#!/bin/zsh
input_file=$1
#arguments
while getopts t:j:p:s: flag
do
case "${flag}" in
t) template_file=${OPTARG};;
j) job_file=${OPTARG};;
p) prefix=${OPTARG};;
s) suffix=${OPTARG};;
esac
done
#Determine number of atoms
n_atoms=$(sed -n 1p $1)
#Determine number of lines to separate
splitlines=$(echo $n_atoms | awk '{print ($0+2)}')
#determine number of conformers
n_conformers=$(grep -c "$n_atoms" $1)
#echo $splitlines
#Split the coordinates into individual .xyz files
split -dl $splitlines $1 coords
#Rename coordinate files
for file in coords*
do
sed -i -e 1,2d $file
mv "$file" "$file.xyz"
done
rm *-e
#Copy coordinates into template file
n=1
while read file
do
sed -i "" "s/COORDINATES/"$file"/r" template.inp > ea.h2s-input${n}.inp
((n++))
done < coords00.xyz #first coordinate file produced
In this example, coords00.xyz is the first of the separated coordinate files the script generates, template.inp is the template file, and ea.h2s-input${n}.inp is the name of the resulting input file (which I will later make customisable with arguments, hopefully).
Bear in mind that, during testing, I've been trying to get the simple things working, so this script is only written to get the first geometry copied into the template file (hence why the files are named explicitly, rather than as variables - although I'm hoping to use the arguments at the beginning of the script to help name each resulting input file), but I can't even get that to work!
Unfortunately, all other forum posts I've found only talk about copying small bits of text (a name, a word, one line of text) into templates - never multiple lines, let alone the entire contents of a file.
I have tried everything I can think of to get this to work, and this script is as close as I have gotten, but I cannot figure out how to print all of the coordinate lines into the template. Any help would be greatly appreciated!
If you are open to using awk:
awk -v tmpl="$(<src.tmpl)" 'BEGIN{cnt=1} \
NR==1 {n_atoms=$1} \
NF==1 {flag=1; close(out); out=sprintf("coords""%02d"".xyz", cnt)} \
{if(flag==0) {print $0 > out; coord_cnt++}} \
{if(coord_cnt==n_atoms){printf "*\n" > out; coord_cnt=0}}\
{if (NF==1 && $1!=n_atoms) { flag=0; \
printf "#Comment line, molecule number %s\n%s\n", cnt, tmpl > out; cnt+=1}}' src.txt
template file contents should look like: (newlines escaped)
# \
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym \
%method \
functional mgga_xc_b97m_v \
end \
etc etc etc... \
\
* xyz 0 1
src.txt contents:
13
-15.02035015
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
13
-15.02029090
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
Output: (note output files numbering starts with '01')
$ ls coords0*
coords01.xyz coords02.xyz
$ cat coords0*
#Comment line, molecule number 1
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
C 3.0629012683 -0.1237662359 -0.0004161296
C 1.5725410176 -0.4599705612 -0.0010537192
H 3.6545324244 -1.0351015878 -0.0040975574
H 3.3232895577 0.4531937573 0.8855087768
H 3.3225341254 0.4598595336 -0.8822056347
N 0.6972014643 0.7054585380 0.0017284824
H 1.3274001069 -1.0545725774 0.8830977697
H 1.3271390154 -1.0504225891 -0.8878762403
H 0.8745667924 1.2800026498 -0.8166554074
H 0.8753847581 1.2767560879 0.8221982135
S -2.4024384670 -0.0657095889 -0.0009217321
H -2.1207044390 -1.3609141502 0.0227283569
H -1.0945221361 0.2739471520 0.0001162389
*
#Comment line, molecule number 2
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...
* xyz 0 1
C 3.0458878237 -0.1642767706 -0.0538270794
C 1.5490175255 -0.4572401536 0.0316764611
H 3.3628431459 0.4546044246 0.7845264240
H 3.2796163460 0.3602842378 -0.9790015411
H 3.6124852940 -1.0910645341 -0.0311065021
N 0.7057821467 0.7323073404 0.0100678359
H 1.3291157247 -0.9968212951 0.9565729700
H 1.2449884019 -1.0864558318 -0.8086148493
H 0.8643815373 1.2571851525 -0.8447936589
H 0.9361625337 1.3407384060 0.7900308086
S -2.3784808925 -0.1009812166 -0.0319557326
H -2.4637876581 -0.0476175701 1.2900767837
H -1.0744168237 0.2509451631 -0.0171658709
*
i have found this script on the server , would you please support me regarding the aim of the below options , what is the reason behind using N , P , h , d
sed '/Text1/{N;N;N;N;N;N;N;N;N;N;N;N;N;P;h;
s/.*\(Text2\)/\1/;P;g;
s/.*\(Text3\)/\1/;P;g;
s/.*\(Text4\)/\1/;P;g;
s/.*\(Text5\)/\1/;P;g;
s/.*\(Text6\)/\1/;P;g;
s/.*\(Text7\)/\1/;P;g;
s/.*\(Text8\)/\1/;P;g;
s/.*\(Text9\)/\1/;P;g;
s/.*\(Text10\)/\1/;P;g;
s/.*\(Text11\)/\1/;P;g;
s/.*\(Text12\)/\1/;P;g;
s/.*\(Text13\)/\1/;P;g;
s/.*\(Text14\)/\1/;P;d;}'
I want to put one column from one file, the column 7, (i.e motherfile) to the end column of many files (i.e child1.c, chil2.c child3.c and so on)
motherfile
38 WAT1 1 TIP3 OH2 OT -0.834000 15.9994 0
39 WAT1 1 TIP3 H1 HT 0.417000 1.0080 0
40 WAT1 1 TIP3 H2 HT 0.417000 1.0080 0
41 WAT1 2 TIP3 OH2 OT -0.834000 15.9994 0
42 WAT1 2 TIP3 H1 HT 0.417000 1.0080 0
child1.c
O -5.689000 -0.628000 -10.423000
H -6.663000 -0.744000 -10.224000
H -5.166000 -1.340000 -9.957000
O 11.405000 3.612000 1.674000
H 11.331000 4.609000 1.663000
child2.c
O -4.689000 -0.628000 -10.423000
H -5.663000 -0.744000 -10.224000
H -6.166000 -1.340000 -9.957000
O 1.4405000 3.612000 1.674000
H 14.331000 4.609000 1.663000
and so on ...
I tried to use
awk '{f1 = $0; getline<"motherfile"; print f1, $7}' < child1.c > newchild1.c
but this only function to add a column to one file , and I want to put the column to many files.
Note the newchild.c need to be like this one.
O -5.689000 -0.628000 -10.423000 -0.834000
H -6.663000 -0.744000 -10.224000 0.417000
H -5.166000 -1.340000 -9.957000 0.417000
O 11.405000 3.612000 1.674000 -0.834000
H 11.331000 4.609000 1.663000 0.417000
In awk print statements can be redirected to a file using > or >>. The following example will read column 7 of the motherfile into memory, and write to a new file, pretended with the string new, including the saved column.
awk 'NR==FNR{a[FNR]=$7;next}{print$0,a[FNR]>"new"FILENAME}' motherfile
child1.c child2.c ...
Let's say my grammar is:
file = line, {line}
line = ..., "\n"
If I want to build a LL parser for that grammar, how should I implement the "one or more line"?
I was thinking about changing the grammar to this:
file = line
line = ..., "\n", nl
nl = line
| <end of file>
My lines would be nested. Is this the most elegant/efficient way to solve the problem ?
Close. Typically just like this:
file = line, morelines
morelines = e | line, morelines
line = ..., "\n"
Where e is the epsilon or empty symbol
I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.