Removing lines depending upon keyword occurance - bash

I have 7,000 files(sade1.pdbqt ... sade7200.pdbqt). Only some of the files contains second and so occurrence of a keyword TORSDOF. For a given file, I want to remove all lines following the first occurrence if there is second occurrence of keyword TORSDOF, while preserving the file names. Can somebody please provide a sample snippet. Thank you.
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
Jatin
Vishal
Shivani
TORSDOF
Sushil
Kiran
after function run
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
EDIT1: Actual file copy-
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK 3 A between atoms: C_9 and C_10
REMARK 4 A between atoms: C_10 and N_11
REMARK 5 A between atoms: C_15 and O_17
REMARK 6 A between atoms: C_25 and O_28
REMARK 7 A between atoms: C_27 and O_33
REMARK 8 A between atoms: O_28 and C_29
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 C UNL 1 7.579 11.905 0.000 0.00 0.00 +0.000 C
ATOM 2 C UNL 1 7.579 10.500 0.000 0.00 0.00 +0.000 C
ATOM 30 O UNL 1 8.796 8.398 0.000 0.00 0.00 +0.000 OA
ENDROOT
BRANCH 21 31
ATOM 31 O UNL 1 13.701 7.068 0.000 0.00 0.00 +0.000 OA
ATOM 32 C UNL 1 12.306 6.953 0.000 0.00 0.00 +0.000 C
ENDBRANCH 41 42
ENDBRANCH 19 41
TORSDOF 8
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 CL UNL 1 0.000 11.656 0.000 0.00 0.00 +0.000 Cl
ENDROOT
TORSDOF 0

What I would do:
#!/bin/bash
for file in sade*.pdbqt; do
count=$(grep -c '^TORSDOF' "$file")
if ((count>1)); then
awk '/^TORSDOF/{print;exit}1' "$file" > /tmp/.torsdof &&
mv /tmp/.torsdof "$file"
fi
done

Related

How do I return a varying number as a variable in a string found in another file that otherwise stays constant (BASH)?

I have a file that contains text like this (only a portion of it here) and want to find the ATOM # associated with the O5' line (in this case "2"). I would then like to store this number as a variable for future use. Note that the data below is stored in another file titled "xyz.file" for example. The number of spaces between "ATOM" and the column the number of interest is found in may vary as the number of interest's value changes.
ATOM 1 HO5' G5 1 7.415 -9.123 -8.109 1.00 0.00
ATOM 2 O5' G5 1 7.997 -8.960 -8.863 1.00 0.00
ATOM 3 C5' G5 1 9.136 -9.784 -8.729 1.00 0.00
ATOM 4 H5' G5 1 9.679 -9.808 -9.673 1.00 0.00
ATOM 5 H5'' G5 1 8.814 -10.797 -8.484 1.00 0.00
ATOM 6 C4' G5 1 10.067 -9.272 -7.628 1.00 0.00
ATOM 7 H4' G5 1 10.847 -10.015 -7.448 1.00 0.00
ATOM 8 O4' G5 1 10.700 -8.053 -7.990 1.00 0.00
ATOM 9 C1' G5 1 10.866 -7.262 -6.821 1.00 0.00
ATOM 10 H1' G5 1 11.907 -6.970 -6.696 1.00 0.00
ATOM 11 N9 G5 1 10.027 -6.048 -6.896 1.00 0.00
An awk one-liner:
n=$(awk '$3 == "O5'\''" {print $2; quit}' file)
echo $n
prints
2

bash: cat + grep to produce several replicas merging two filles

Using Linux bash command line, I need to merge two filles integrating several copies of the file1 inside the specified part of the file 2. The file 1 looks like:
ATOM 1 N SER A 1 -2.390 4.343 -17.003 1.00 27.76 N1+
ATOM 2 CA SER A 1 -2.066 5.647 -16.370 1.00 27.12 C
ATOM 3 C SER A 1 -2.394 5.608 -14.874 1.00 26.29 C
ATOM 4 O SER A 1 -3.014 4.627 -14.405 1.00 22.93 O
ATOM 5 CB SER A 1 -2.771 6.798 -17.057 1.00 28.10 C
ATOM 6 OG SER A 1 -2.538 8.023 -16.373 1.00 32.02 O
ATOM 7 N GLY A 2 -1.982 6.655 -14.162 1.00 25.31 N
ATOM 8 CA GLY A 2 -2.172 6.779 -12.716 1.00 24.93 C
ATOM 9 C GLY A 2 -0.888 6.336 -12.067 1.00 23.66 C
ATOM 10 O GLY A 2 -0.168 5.459 -12.608 1.00 27.42 O
ATOM 11 N PHE A 3 -0.636 6.866 -10.900 1.00 22.07 N
ATOM 12 CA PHE A 3 0.622 6.595 -10.191 1.00 21.70 C
ATOM 13 C PHE A 3 0.279 6.570 -8.716 1.00 20.39 C
ATOM 14 O PHE A 3 -0.265 7.544 -8.167 1.00 23.83 O
the file 2 is a multi-block, where separate parts are defined by model1,model 2, model N and separated by ENDMDL:
MODEL 1
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ENDMDL
MODEL 2
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ENDMDL
MODEL 3
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ENDMDL
I need to copy several times all the containt of the file 1 into the file 2 just before the separator ENDMDL (in the second file), thus integrating several coppies of the file 1 into the file 2. Here is the example of expected output:
MODEL 1
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ATOM 1 N SER A 1 -2.390 4.343 -17.003 1.00 27.76 N1+
ATOM 2 CA SER A 1 -2.066 5.647 -16.370 1.00 27.12 C
ATOM 3 C SER A 1 -2.394 5.608 -14.874 1.00 26.29 C
ATOM 4 O SER A 1 -3.014 4.627 -14.405 1.00 22.93 O
ATOM 5 CB SER A 1 -2.771 6.798 -17.057 1.00 28.10 C
ATOM 6 OG SER A 1 -2.538 8.023 -16.373 1.00 32.02 O
ATOM 7 N GLY A 2 -1.982 6.655 -14.162 1.00 25.31 N
ATOM 8 CA GLY A 2 -2.172 6.779 -12.716 1.00 24.93 C
ATOM 9 C GLY A 2 -0.888 6.336 -12.067 1.00 23.66 C
ATOM 10 O GLY A 2 -0.168 5.459 -12.608 1.00 27.42 O
ATOM 11 N PHE A 3 -0.636 6.866 -10.900 1.00 22.07 N
ATOM 12 CA PHE A 3 0.622 6.595 -10.191 1.00 21.70 C
ATOM 13 C PHE A 3 0.279 6.570 -8.716 1.00 20.39 C
ATOM 14 O PHE A 3 -0.265 7.544 -8.167 1.00 23.83 O
ENDMDL
MODEL 2
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ATOM 1 N SER A 1 -2.390 4.343 -17.003 1.00 27.76 N1+
ATOM 2 CA SER A 1 -2.066 5.647 -16.370 1.00 27.12 C
ATOM 3 C SER A 1 -2.394 5.608 -14.874 1.00 26.29 C
ATOM 4 O SER A 1 -3.014 4.627 -14.405 1.00 22.93 O
ATOM 5 CB SER A 1 -2.771 6.798 -17.057 1.00 28.10 C
ATOM 6 OG SER A 1 -2.538 8.023 -16.373 1.00 32.02 O
ATOM 7 N GLY A 2 -1.982 6.655 -14.162 1.00 25.31 N
ATOM 8 CA GLY A 2 -2.172 6.779 -12.716 1.00 24.93 C
ATOM 9 C GLY A 2 -0.888 6.336 -12.067 1.00 23.66 C
ATOM 10 O GLY A 2 -0.168 5.459 -12.608 1.00 27.42 O
ATOM 11 N PHE A 3 -0.636 6.866 -10.900 1.00 22.07 N
ATOM 12 CA PHE A 3 0.622 6.595 -10.191 1.00 21.70 C
ATOM 13 C PHE A 3 0.279 6.570 -8.716 1.00 20.39 C
ATOM 14 O PHE A 3 -0.265 7.544 -8.167 1.00 23.83 O
ENDMDL
MODEL 3
REMARK VINA RESULT: -7.828 0.000 0.000
REMARK INTER + INTRA: -13.769
REMARK INTER: -10.110
REMARK INTRA: -3.659
REMARK UNBOUND: -3.196
ATOM 1 N SER A 1 -2.390 4.343 -17.003 1.00 27.76 N1+
ATOM 2 CA SER A 1 -2.066 5.647 -16.370 1.00 27.12 C
ATOM 3 C SER A 1 -2.394 5.608 -14.874 1.00 26.29 C
ATOM 4 O SER A 1 -3.014 4.627 -14.405 1.00 22.93 O
ATOM 5 CB SER A 1 -2.771 6.798 -17.057 1.00 28.10 C
ATOM 6 OG SER A 1 -2.538 8.023 -16.373 1.00 32.02 O
ATOM 7 N GLY A 2 -1.982 6.655 -14.162 1.00 25.31 N
ATOM 8 CA GLY A 2 -2.172 6.779 -12.716 1.00 24.93 C
ATOM 9 C GLY A 2 -0.888 6.336 -12.067 1.00 23.66 C
ATOM 10 O GLY A 2 -0.168 5.459 -12.608 1.00 27.42 O
ATOM 11 N PHE A 3 -0.636 6.866 -10.900 1.00 22.07 N
ATOM 12 CA PHE A 3 0.622 6.595 -10.191 1.00 21.70 C
ATOM 13 C PHE A 3 0.279 6.570 -8.716 1.00 20.39 C
ATOM 14 O PHE A 3 -0.265 7.544 -8.167 1.00 23.83 O
ENDMDL
I have tried to use cat BUT it just fused the both files together without the required replication of the first file:
cat file1.pdb file2.pdb > together.pdb
Need I pipe this to some expression of grep in order to replicate the file1 in the positions before the ENDMDL of the file 2 ?
Here is an awk solution that doesn't call unsafe system or getline:
awk 'NR==FNR {s = s $0 ORS; next} $0 == "ENDMDL" {$0 = s $0} 1' file1 file2
If you want to pass shell variable names then use:
awk 'NR==FNR {s = s $0 ORS; next}
$0 == "ENDMDL" {$0 = s $0} 1' "$file1" "$file2"
Use awk.
awk '/^ENDMDL$/ {system("cat file1.pdb");}; {print}' file2.pdb
Each line from file2 is written to standard output, but when the line matches ENDMDL, the entire contents of file1 are output first.
Some alternatives:
Replace /^ENDMDL$/ with $0 == "ENDMDL"
Replace {print} with 1. (With no explicit pattern, the action is performed. With no explicit action, the current line is printed.)
Here's a straight-forward awk solution:
awk '
BEGIN {
FS = RS = "\a"
getline contents < ARGV[2]
close(ARGV[2])
ARGV[2] = ""
RS = "\n"
}
/^ENDMDL$/ { printf "%s", contents }
{ print }
' file1 file2
The script slurps the file content (the one to be inserted) into a variable then prints it each time ENDMDL appears. I'm using the BELL character as FS and RS because you won't encounter it in a PDB file.

delete rows after specific character | awk

I am writing a Bash script and,
I need to remove all lines in between TER, including 'TER's
Input File :
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER
Expected output:
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
I found
sed '/TER/,$d' ${myArray[j]}.txt >> ${MyArray[j]}.txt ### ${MyArray[j]} file name through an array
But this does not work, I think awk will work with Bash Script. help Thanks
You can just use sed like this:
sed -i.bak '/^TER/,/^TER/d' "${myArray[j]}.txt"
cat "${myArray[j]}.txt"
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/TER/,/TER/d'
echo
"ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER" |sed '/TER/,/TER/d'
######################################################################################
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/Start Pattern/,/End Pattern/d'
It can be done like this
sed '/TER/,$d' ${myArray[j]}.txt > tmp.txt #note only one " > "
mv tmp.txt ${myArray[j]}.txt
awk also provides a simple solution using a flag to control printing. Below the skip variable is used as a flag. If 1 the lines are skipped, on the transition from 1 to 0, the script exits.
awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
Above $1=="TER" is used to match lines (records) where the first field is TER (this disambiguates between "TER" and "TERMINAL", etc...) Within the rule, the ternary skip=skip?1:0 sets skip=1 the first time "TER" is encountered and to 0 on the next. If skip==0 the script exits. The 1 at the end is just shorthand for print.
Example Use/Output
Using your data in file, you would get:
$ awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

How to compare multiple columns in two files and retrieve the corresponding value from another column if match found

I have two files File1.txt and File2.txt. I need to compare the three columns 1, 2 and 3 from File1 with 4,5 and 6 of File2, respectively and if the match is found, I want to retrieve the corresponding value from the column 2 of File2 and paste it in Output.txt. The sample files are as below:
File1.txt
ASP B 276
ASN B 290
ALA B 294
ALA B 297
ARG B 298
ARG B 303
LYS D 288
File2.txt
ATOM 4770 N ALA A 346 71.417 37.005 4.562 1 0 N
ATOM 4778 C ALA A 346 72.003 34.855 3.476 1 0 C
ATOM 4779 O ALA A 346 72.956 34.2 3.103 1 0 O
ATOM 4859 N SER A 353 78.218 33.415 -2.595 1 0 N
ATOM 4867 HG SER A 353 78.828 31.548 0.899 1 0
ATOM 4868 C SER A 353 79.637 31.351 -2.619 1 0 C
ATOM 4869 O SER A 353 80.669 30.76 -2.372 1 0 O
ATOM 9771 N ASP B 238 52.86 30.061 -7.031 1 0 N
ATOM 9772 H ASP B 238 53.651 30.105 -7.641 1 0
ATOM 9776 HB1 ASP B 238 53.516 32.92 -5.486 1 0
ATOM 10352 H ASP B 276 11.565 35.255 6.968 1 0
ATOM 10356 HB1 ASP B 276 10.084 33.659 6.727 1 0
ATOM 10357 HB2 ASP B 276 10.331 32.059 6.945 1 0
ATOM 10358 CG ASP B 276 9.946 33.07 8.681 1 0 C
ATOM 10453 H ASN B 290 16.73 30.519 13.339 1 0
ATOM 10454 CA ASN B 290 18.755 31.013 13.763 1 0 C
ATOM 10458 HB2 ASN B 290 20.105 29.465 13.891 1 0
ATOM 10459 CG ASN B 290 18.471 28.842 14.99 1 0 C
ATOM 10460 OD1 ASN B 290 18.246 29.429 16.072 1 0 O
ATOM 10512 H ALA B 294 24.099 33.167 8.943 1 0
ATOM 10513 CA ALA B 294 26.095 33.794 9.273 1 0 C
ATOM 10514 HA ALA B 294 26.597 34.261 8.545 1 0
ATOM 10515 CB ALA B 294 25.515 34.817 10.199 1 0 C
ATOM 10556 H ALA B 297 28.288 31.299 7.752 1 0
ATOM 10557 CA ALA B 297 30.202 31.869 7.061 1 0 C
ATOM 10558 HA ALA B 297 30.566 31.457 6.226 1 0
ATOM 10566 H ARG B 298 30.012 32.059 9.568 1 0
ATOM 10567 CA ARG B 298 31.961 32.047 10.392 1 0 C
ATOM 10568 HA ARG B 298 32.532 32.853 10.237 1 0
ATOM 10569 CB ARG B 298 31.251 32.167 11.74 1 0 C
ATOM 10650 HE ARG B 303 36.405 23.564 2.394 1 0
ATOM 10651 CZ ARG B 303 34.807 22.582 3.07 1 0 C
ATOM 10652 NH1 ARG B 303 33.867 22.493 3.991 1 0 N
ATOM 10653 1HH1 ARG B 303 33.829 23.162 4.733 1 0
ATOM 10654 2HH1 ARG B 303 33.192 21.757 3.947 1 0
ATOM 10655 NH2 ARG B 303 34.847 21.706 2.081 1 0 N
ATOM 17143 OE1 GLU C 295 59.322 13.561 -6.631 1 0 O
ATOM 17144 OE2 GLU C 295 57.646 14.02 -7.941 1 0 O
ATOM 17145 C GLU C 295 54.718 13.527 -3.448 1 0 C
ATOM 17146 O GLU C 295 54.509 14.618 -2.982 1 0 O
ATOM 23627 HB1 LYS D 288 32.909 52.854 29.282 1 0
ATOM 23628 HB2 LYS D 288 31.41 53.372 29.672 1 0
ATOM 23629 CG LYS D 288 32.811 53.749 31.138 1 0 C
ATOM 23630 HG1 LYS D 288 32.137 53.82 31.873 1 0
ATOM 23631 HG2 LYS D 288 33.636 53.303 31.484 1 0
ATOM 23632 CD LYS D 288 33.168 55.146 30.656 1 0 C
The output.txt should contain the values of column 2 of second file only if the three columns of File1 matches with the three columns of File2.
Output.txt
10352
10356
10357
10358
10453
10454
10458
10459
10460
10512
10513
10514
10515
10556
10557
10558
10566
10567
10568
10569
10650
10651
10652
10653
10654
10655
23627
23628
23629
23630
23631
23632
I have tried with awk one liner, which is provided below. The script was executed but has retrieved different values of the column 2 of File2. Hence, I need help in resolving this and finding out where am I going wrong.
awk 'FNR==NR{a[$1,$2,$3]=$0;next}{if(b=a[$4,$5,$6]){print $2}}' File1.txt File2.txt > Output.txt
Thanks in advance.
Asha,
MBU, IISc,
Bangalore, India
You can use this awk command:
awk 'FNR==NR{a[$1,$2,$3]; next} ($4,$5,$6) in a{print $2}' file1 file2
10352
10356
10357
10358
10453
10454
10458
10459
10460
10512
10513
10514
10515
10556
10557
10558
10566
10567
10568
10569
10650
10651
10652
10653
10654
10655
23627
23628
23629
23630
23631
23632

output two commands on same line

I have seen this question a few times, but the solutions I cannot get to work.
I have the following command:
printf '%s\n' "${fa[#]}" | xargs -n 3 bash -c 'cat *-$2.ss | sed -n 11,1p ; echo $0 $1 $2;'
where
printf '%s\n' "${fa[#]}"
O00238 115 03
O00238 126 04
and cat *-$2.ss gives:
1 D C 0.999 0.000 0.000
2 L C 0.940 0.034 0.012
3 H C 0.971 0.005 0.015
4 P C 0.977 0.005 0.009
5 T C 0.970 0.009 0.018
6 L C 0.977 0.006 0.011
7 P C 0.864 0.027 0.014
8 P C 0.966 0.018 0.011
9 L C 0.920 0.038 0.039
10 K C 0.924 0.043 0.039
11 D C 0.935 0.036 0.035
12 R C 0.934 0.023 0.053
13 D C 0.932 0.022 0.046
14 F C 0.878 0.041 0.088
15 V C 0.805 0.031 0.198
16 D C 0.834 0.039 0.108
17 G C 0.882 0.019 0.071
18 P C 0.800 0.031 0.132
19 I C 0.893 0.039 0.070
20 H C 0.823 0.024 0.179
21 H C 0.920 0.026 0.070
22 R C 0.996 0.001 0.002
running the command then produces
11 D C 0.935 0.036 0.035
O00238 115 03
11 K C 0.449 0.252 0.270
O00238 126 04
Even lines are the output of sed -n 11,1p, odd lines the output of echo $0 $1 $2
How do I pair the output on the same line i.e.
11 D C 0.935 0.036 0.035 O00238 115 03
11 K C 0.449 0.252 0.270 O00238 126 04
I have tried:
printf '%s\n' "${fa[#]}" | xargs -n 3 bash -c 'cat *-$2.ss | {sed -n 11,1p ; echo $0 $1 $2;} | tr "\n" " "'
as suggested here: Concatenate in bash the output of two commands without newline character
however I get
O00238: -c: line 0: syntax error near unexpected token `}'
O00238: -c: line 0: `cat *-$2.ss | {sed -n 11,1p ; echo $0 $1 $2;} | tr "\n" " "'
What is the problem?
You could try using something like this:
i=0
for f in *-"$2".ss; do printf '%s %s\n' "$(sed -n '11p' "$f")" "${fa[$((i++))]}"; done
This loops through your files and prints the 11th line alongside a slice from the array fa, whose index i increases by 1 every iteration.
I could not reproduce your setup, but
printf "O00238 115 03\nO00238 126 04" | xargs -n 3 bash -c 'cat test.dat | sed -n 11,1p | tr -d "\n"; echo " $0 $1 $2"'
gives
11 D C 0.935 0.036 0.035 O00238 115 03
11 D C 0.935 0.036 0.035 O00238 126 04
which should work in your case. I just deleted the newline of the sed command.

Resources