Getting blocks of strings between specific characters in text file "bash" - bash

can someone help me finding a code for copying all the strings between the string 'X 0' (X = H, He...) and the nearest '****'? I use bash for programming.
H 0
S 3 1.00 0.000000000000
0.1873113696D+02 0.3349460434D 01
0.2825394365D+01 0.2347269535D+00
0.6401216923D+00 0.8137573261D+00
S 1 1.00 0.000000000000
0.1612777588D+00 0.1000000000D+01
****
He 0
S 3 1.00 0.000000000000
0.3842163400D+02 0.4013973935D 01
0.5778030000D+01 0.2612460970D+00
0.1241774000D+01 0.7931846246D+00
S 1 1.00 0.000000000000
0.2979640000D+00 0.1000000000D+01
****
I want to do this for all the "X 0" (X = H, He...) especifically, obtaining a isolated text like that, for all the "X 0":
H 0
S 3 1.00 0.000000000000
0.1873113696D+02 0.3349460434D 01
0.2825394365D+01 0.2347269535D+00
0.6401216923D+00 0.8137573261D+00
S 1 1.00 0.000000000000
0.1612777588D+00 0.1000000000D+01
****
and
He 0
S 3 1.00 0.000000000000
0.3842163400D+02 0.4013973935D 01
0.5778030000D+01 0.2612460970D+00
0.1241774000D+01 0.7931846246D+00
S 1 1.00 0.000000000000
0.2979640000D+00 0.1000000000D+01
****
So I think I have to find a way to do it using the string containing "X 0".
I was trying to use grep -A2000 'H 0' filename.txt | grep -B2000 -m8 '****' filename.txt >> filenameH.txt but its not so usefull for the other exemples of X, just for the first.

Using awk:
awk '/^[^ ]+ 0$/{p=1;++c}/^\*\*\*\*$/{print >>FILENAME c;p=0}p{print >> FILENAME c}' file
The script creates as many files as there are blocks matching the the patterns /^[^ ]+ 0$/ and /^\*\*\*\*$/. The file index starts at 1.

if the records are separated with 4 stars. Needs gawk
$ awk -v RS='\\*\\*\\*\\*\n' '$1~/^He?$/{printf "%s", $0 RT > FILENAME $1}' file
this will only extract H and He records. If you don't want to restrict, just remove the condition before the curly brace. (Equivalent to $1=="H" || $1=="He")

Related

how to use sed to replace the specific line/lines in a file with the contents from another file

I want to replace several lines in one of my files with the contents (which consists of the same lines) from another file which is located in another folder with the sed command.
For example: file1.txt is in /storage/file folder, and it looks like this:
'ABC'
'EFG' 001
HJK
file2.txtis located in /storage folder, and it looks like this:
'kkk' 123456789
yyy
so I want to use the content of file2.txt (which is one line) to replace the 2nd and 3rd line of file1.txt, and `file1.txt' should become like this:
'ABC'
'kkk' 123456789
yyy
I probably should make my questions more clear. So I'm trying to write a shell script which can be used to change several lines of a file (let's call it old.txt) with the new contents that I supplied in other files (which only contains the contents to be updated to the old file, for example, these files are dataA.txt,dataB.txt...... ).
Let's say, I want to replace the 3rd line of old.txt which is:
'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 100 77760 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
with the new data that I supplied in dataA.txt which is:
'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 500 8520 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
and to replace the 15th to 18th lines of the old.txt file which looks like:
100 0 1
101 1 2
102 2 1.5
103 4 52
with the supplied `dataB.txt' file which looks like (also contain 4 lines):
-100
-101
-102
-103
As I'm totally new to shell script programming, and I only used sedbefore, I tried the following command:
to change the 3ed line, I did sed -i '3c r ../../dataA.txt' old.txt, r ../../dataA.txt is to find the location of dataA.txt. However, as c needs to be followed by the content that to be changed rather the path of the content that to be changed. so I'm not very sure how to correctly use sed. Another idea that I'm thinking is to insert the dataA.txt ,dataB.txt... in front of the line that I want to modify and then deleted the old lines. But I'm still not sure how to do it after I googled for so long...
To replace a range of lines with entire contents of another file:
sed -e '15r file2' -e '15,18d' file1
To replace a single line with entire contents of another file:
sed -e '2{r file2' -e 'd}' file1
If you don't know whether file2 ends in newline or not, you can use the below trick (see What does this mean in Linux sed '$a\' a.txt):
sed '$ a\' file2 | sed -e '3{r /dev/stdin' -e 'd}' file1
The main trick is to use r command to add contents from the other file for the starting line address. And then delete the line(s) to be replaced. The -e option is needed because everything after r will be treated as filename.
Note that these have been tested with GNU sed, I'm not sure if it will vary for other implementations.
See my github repo for more examples, such as matching lines based on regex instead of line numbers.
It is trivial with ed
printf '%s\n' '2,$d' 'r /storage/file2.txt' ,p Q | ed -s /storagefile/file1.txt
A syntax that should work with more variety of Unix shells.
printf '2,$d\nr /storage/file2.txt\n,p\nQ\n' | ed -s /storage/file/file1.txt
2,$d means 2 and $ are the line addresses, 2 is line 2 and $ is the last line in the buffer and d means delete.
,p means print everything to stdout which is your screen.
Q means silence the error which q will not.
With ed to change line 3 of a file with another content of a file, without using shell variables.
First delete the content of line 3 of the file.
printf '%s\n' '3d' ,p Q | ed -s file1.txt
Then add the content of the other file, say file2.txt at line 3.
printf '2r file2.txt' ,p Q | ed -s file1.txt
To replace a group/set of lines in a file with the content of another file.
First delete the lines, say 15 to 18 from say file1.txt
printf '%s\n' '15,18d' ,p Q | ed -s file1.txt
Then add the content of say file2.txt to line 15 of file1.txt
printf '%s\n' '14r file2.txt' ,p Q | ed -s file1.txt
The Q does not edit anything replace it with w to edit files.
The r appends so 14 r means append the content of another file after line 14 which makes it line 15. Same is true with 2 r append to line 2 which makes it line 3.
Also all of that can be done with one line, this code was adopted with your data/files names. Also this assumes that all the text file are in the same directory where you will run the code below, otherwise add the absolute path of the files in question.
printf '%s\n' '3d' '2r dataA.txt' '15,18d' '14r dataB.txt' ,n Q | ed -s old.txt
Replace the Q with w If you're satisfied with the output and if you want to actually edit the old.txt
the ,n prints everything to stdout which is your screen but with a line number at the front.
To have an idea of what the actual code is being pipe to ed remove or comment out the pipe | and all the code after that.
See info ed or man ed for more info about ed
An example of that ed script.
Create a new directory and cd into it.
mkdir temp && cd temp
cat dataA.txt
Output
'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 500 8520 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
cat dataB.txt
Output
-100
-101
-102
-103
cat old.txt
Output
foo
bar
'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 100 77760 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
a
b
c
d
e
f
g
h
i
j
k
100 0 1
101 1 2
102 2 1.5
103 4 52
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
The script.
printf '%s\n' '3d' '2r dataA.txt' '15,18d' '14r dataB.txt' ,n w | ed -s old.txt
Output
1 foo
2 bar
3 'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 500 8520 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
4 a
5 b
6 c
7 d
8 e
9 f
10 g
11 h
12 i
13 j
14 k
15 -100
16 -101
17 -102
18 -103
19 l
20 m
21 n
22 o
23 p
24 q
25 r
26 s
27 t
28 u
29 v
30 w
31 x
32 y
33 z
The actual old.txt
cat old.txt
Output
foo
bar
'TIME_STEPS' 'TIME CYCLE' 'ELAPSED' 500 8520 0 1.e+99 1. 9999 1. 1.e-20 1.e+99
a
b
c
d
e
f
g
h
i
j
k
-100
-101
-102
-103
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

Substitute word in a specific line with multiple keyword check

I need to modify a lot of .pdb file for work and i need to script this operation to not waste time doing it manually every time.
I got a file with this particular format (this is an extract from the file, you can see full file here):
ATOM 5210 C4 G B 96 10.157 -47.431 -42.832 1.00 43.97 C
ATOM 5211 P G B 97 11.305 -41.644 -44.835 1.00 26.64 P
ATOM 5212 OP1 A B 97 12.654 -41.242 -44.460 1.00 26.64 O
ATOM 5213 OP2 A B 97 10.167 -41.192 -44.014 1.00 26.64 O
ATOM 5214 O5' A B 97 11.079 -41.206 -46.340 1.00 26.64 O
In particular for each file i need to substitute the word 'OP1' in third column with another keyword, but ONLY if the first column display 'ATOM' and there is a particular number on sixth column.
I tried to script it with sed but I didn't get any decent result.
Hope anyone can help
Thanks
A simply way to start:
improved:
awk '{if ($1=="ATOM" && $6=="410" && $3=="OP2")sub($3,"XXX"); print }' 1X8W.pdb
Try this script
while read p; do
value1=`echo $p | cut -d' ' -f1`
value2=`echo $p | cut -d' ' -f3`
value3=`echo $p | cut -d' ' -f6`
if [ "$value1" == "ATOM" ] && [ $value3 == 97 ]; then
if [ "$value2" == "OP1" ]; then
echo $p | awk '{gsub("OP1", "newtext", $0); print}'
fi
fi
done < 1X8W.pdb
Change newtext with the text you want to be replaced for OP1. Also, change the $value3 comparison number from 97 to anything else if you are checking for any other number as well.
Sounds like this might be what you're trying to do:
awk '($1=="ATOM") && ($6==97) { sub(/^OP1$/,"other",$3); print }' file
but without more details in your question we're all just guessing and can't test it.

Replace a value in a file by another one (bash/awk)

I have a file (a coordinates file for those who know what it is) like following :
1 C 1
2 C 1 1 1.60000
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
and so on.. My idea is to replace the value "1.60000" in the second line, by other values using a for loop.
I would like the value to start at, lets say 0, and stop at 2.0 for example, with a increment step of 0.05
Here is what I already tried:
#! /bin/bash
a=0;
for ((i=0; i<=10 (for example); i++)); do
awk '{if ((NR==2) && ($5=="1.60000")) {($5=a)} print $0 }' file.dat > ${i}_file.dat
a=$((a+0.05))
done
But, unfortunately it doesn't work. I tried a lot of combination for the {$5=a} statement but without conclusive results.
Here is what I obtained:
1 C 1
2 C 1 1
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
The value 1.6000 simply dissapear or at least replaced by a blank.
Any advice ?
Thanks a lot,
Pierre-Louis
for this perhaps sed is a better alternative
$ v=0.00; for((i=0; i<=40; i++)) do
sed '2s/1.60/'"$v"'/' file > file_"$i";
v=$(echo "$v + 0.05" | bc | xargs printf "%.2f\n");
done
Explanation
sed '2s/1.60/'"$v"'/' file change the value 1.60 on second line with the value of variable v
floating point arithmetic in bash is hard, this adds 0.05 to the value and formats it (0.05 instead of .05) so that we can use it in the substitution with sed.
Exercise to you: in bash try to add 0.05 to 0.05 and format the output as 0.10 with leading zero.
example with awk (glenn's suggestion)
for ((i=0; i<=10; i++)); do
awk -v "i=$i" '
(FNR==2){ $5=sprintf("%2.1f ",i*0.5); print $0 }
' file.dat # > $i_file.dat # uncomment for a file output
done
advantage: it's awk who manage floating-point arithmetic

extract file names from a folder based on conditions

I have a folder which has files with the following contents.
ATOM 9 CE1 PHE A 1 70.635 -26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915 -26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816 26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795 30.848 101.863 1.00 44.44 N
In some files, the appearance of the 7th column as follows.
ATOM 9 CE1 PHE A 1 70.635-26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915-26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816-26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795-30.848 101.863 1.00 44.44 N
I would like to extract the name of files which have the above type of lines. What is the easy way to do this?
by refering to Erik E. Lorenz answer
you can simply do
grep -l '\s-\?[0-9.]\+-[0-9.]\+\s' dir/*
from grep manpage
-l
(The letter ell.) Write only the names of files containing selected
lines to standard output. Pathnames are written once per file searched.
If the standard input is searched, a pathname of (standard input) will
be written, in the POSIX locale. In other locales, standard input may be
replaced by something more appropriate in those locales.
A combination of grep and cut works for me:
grep -H -m 1 '\s-\?[0-9.]\+-[0-9.]\+\s' dir/* | cut -d: -f1
This performs the following steps:
for every file in dir/*, find the first match (-m 1) of two adjacent numbers separated by only a dash
print it with the filename prepended (-H). Should be the default anyway.
extract the file name using cut
This is fast since it only looks for the first line match. If there's other places with two adjacent numbers, consider changing the regex.
Edit:
This doesn't match scientific notation and may falsely report contents such as '.-.', for example in comments. If you're dealing with one of them, you have to expand the regex.
awk 'NF > 10 && $1 ~ /^[[:upper:]]+$/ && $2 ~ /^[[:digit:]]+/ { print FILENAME; nextfile }' *
Will print files that have more than 10 fields in which first field is all uppercase letters and second field is all digits.
Using GNU awk for nextfile:
awk '$7 ~ /[0-9]-[0-9]/{print FILENAME; nextfile}' *
or more efficiently since you just need to test the first line of each file if all lines in a given file have the same format:
awk 'FNR==1{if ($7 ~ /[0-9]-[0-9]/) print FILENAME; nextfile}' *

For loop and value changing using awk

I have the following file format
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
What I am trying to do is to locate a specific value(in particular the value after the sequence SI1 L) and create a series of files with different values. For instance ST1 L 0.020--->ST1 L 0.050. What I have in mind is to give a start value, an end value and a step so as to generate files with different values after the sequence SI1 L. For instance a for loop would work, but I don't know how to use it outside awk.
I am able to locate the value using
awk '$1=="SI1" {printf "%12s\n", $3}' file
I could also use the following to replace the value
awk '$1=="SI1" {gsub(/0.020/, "0.050"); printf "%12s\n", $3}' file
The thing is that the value won't always be 0.020. That's why I need a way to replace the value after the sequence SI1 L and this replacement should be done for many values.
How can this be acheived?
You can try:
awk -vval="0.05" '$1=="SI1"{$3=val}1' file
This will replace SI1 L 0.020 by SI1 L 0.05 in the input file.
Then use a bash script to call the awk program in a for loop..
For instance:
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
i=0
for val in "${vals[#]}"; do
i=$(($i+1))
awk -vval="$val" '$1=="SI1"{$3=val}1' file > "file${i}"
done
If your system has seq command, here is easier script for you.
for val in $(seq 0.02 0.01 0.05)
do
awk -vval="$val" '/SI1 L/{$3=val}1' file > "${val}"
# or Using sed
# sed: sed "s/SI1 L .*/SI1 L $val/" > "${val}"
done

Resources