Add heteroatom to pdb file - bioinformatics

I am using Biopython to perform various operations on a pdb file. Subsequently I would like to add some new atoms to the Biopython structure object generated by Biopython. Is there a good/recommended way to do this in Python. It seems Biopython only provides options to write out existing elements of a pdb file and not to create new ones.

You could have a look at the Python package Biotite (https://www.biotite-python.org/), a package I am developing.
In the following example code, a PDB structure is downloaded, read and then an atom is added:
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io as strucio
# Download lysozyme structure for example
file_name = rcsb.fetch("1aki", "pdb", target_path=".")
# Read the file into Biotite's structure object (atom array)
atom_array = strucio.load_structure(file_name)
# Add an HETATM
atom = struc.Atom(
coord = [1.0, 2.0, 3.0],
chain_id = "A",
# The residue ID is the last ID in the file +1
res_id = atom_array.res_id[-1] + 1,
res_name = "ABC",
hetero = True,
atom_name = "CA",
element = "C"
)
atom_array += struc.array([atom])
# Save edited structure
strucio.save_structure("1aki_edited.pdb", atom_array)
The last lines of 1aki_edited.pdb:
...
HETATM 1075 O HOH A 203 12.580 21.214 5.006 1.00 0.000 O
HETATM 1076 O HOH A 204 19.687 23.750 -4.851 1.00 0.000 O
HETATM 1077 O HOH A 205 27.098 35.956 -12.358 1.00 0.000 O
HETATM 1078 O HOH A 206 37.255 9.634 10.002 1.00 0.000 O
HETATM 1079 O HOH A 207 43.755 23.843 8.038 1.00 0.000 O
HETATM 1080 CA ABC A 208 1.000 2.000 3.000 1.00 0.000 C

I have used RDKit to add and edit atoms in PDB-files succesfully. Below I've shown a small example of how to add a carbon atom to a PDB-file and creating a new .pdb-file
from rdkit import Chem
from rdkit.Chem import rdGeometry
prot = Chem.MolFromPDBFile("./3etr.pdb") #Read in the .pdb-file
protconf = prot.GetConformer() #create a conformer of the molecule
#create an editable mol-object
mw = Chem.RWMol(mol)
#create an editable conformer. This dictates the atoms coordinates and other attributes
mw_conf = mw.GetConformer()
#add a carbon atom to the editable mol. Returns the index of the new atom, which is the same as prot.GetNumAtoms() + 1
c_idx = mw.AddAtom(Chem.Atom(6))
#cartesian coordinates of the new atom. I think the Point3D object is not strictly necessary. but can be easier to handle in RDKit
coord = rdGeometry.Point3D(1.0, 2.0, 3.0)
#set the new coordinates
mw_conf.SetAtomPosition(c_idx, coord)
#save the edited PDB-file
pdb_out = Chem.MolToPDBFile(mw_conf, "_out.pdb")

Related

Perfect scores in multiclassclassification?

I am working on a multiclass classification problem with 3 (1, 2, 3) classes being perfectly distributed. (70 instances of each class resulting in (210, 8) dataframe).
Now my data has all the 3 classes distributed in order i.e first 70 instances are class1, next 70 instances are class 2 and last 70 instances are class 3. I know that this kind of distribution will lead to good score on train set but poor score on test set as the test set has classes that the model has not seen. So I used stratify parameter in train_test_split. Below is my code:-
# SPLITTING
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state =
69, stratify = y)
cross_val_model = cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro')
s_score = cross_val_model.mean()
def objective(trial):
model__n_neighbors = trial.suggest_int('model__n_neighbors', 1, 20)
model__metric = trial.suggest_categorical('model__metric', ['euclidean', 'manhattan',
'minkowski'])
model__weights = trial.suggest_categorical('model__weights', ['uniform', 'distance'])
params = {'model__n_neighbors' : model__n_neighbors,
'model__metric' : model__metric,
'model__weights' : model__weights}
pipe.set_params(**params)
return np.mean( cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro'))
knn_study = optuna.create_study(direction = 'maximize')
knn_study.optimize(objective, n_trials = 10)
knn_study.best_params
optuna_gave_score = knn_study.best_value
pipe.set_params(**knn_study.best_params)
pipe.fit(train_x, train_y)
pred = pipe.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
c_report = classification_report(test_y, pred)
Now the problem is that I am getting perfect scores on everything. The f1 macro score from performing cv is 0.898. Below are my confusion matrix and classification report:-
14 0 0
0 14 0
0 0 14
Classification Report:-
precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 1.00 1.00 14
3 1.00 1.00 1.00 14
accuracy 1.00 42
macro avg 1.00 1.00 1.00 42
weighted avg 1.00 1.00 1.00 42
Am I overfitting or what?
Finally got the answer. The dataset I was using was the issue. The dataset was tailor made for knn algorithm and that was why I was getting perfect scores as I was using the same algorithm.
I got came to this conclusion after I performed a clustering exercise on this dataset and the K-Means algorithm perfectly predicted the clusters.

How to extract lines that are within radius of cartesian coordinates

I have a data file that has the format of the following:
ATOM 4 N ASP A 1 105.665 49.507 41.867 1.00 71.64 N
ATOM 5 CA ASP A 1 105.992 48.589 42.982 1.00 70.20 C
ATOM 6 C ASP A 1 107.024 49.191 43.936 1.00 69.70 C
In row 1 the numbers (105.665, 49.507, and 41.867) are the columns of the coordinates (x,y,z). How do I extract the entire line with coordinates that are within a specified radius and output them in another file? The equation to correlate the coordinates to the radius is:
radius= SQRT(x^2 + y^2 +z^2)
I think you mean this:
awk -v R=124.44 '($7^2)+($8^2)+($9^2) < R^2' YourFile
Change the R=124.44 to match your radius.
Sample Output
ATOM 4 N ASP A 1 105.665 49.507 41.867 1.00 71.64 N
ATOM 5 CA ASP A 1 105.992 48.589 42.982 1.00 70.20 C

How I can add the same text to several files via script?

I want a simple way to add the same text (e.g. "bye" or more lines) to a group of files using a small script. I tried something with ed and vi inside of a script, but it did'nt work.
Edit: I edit this coment to be more specific:
I have the files c0001.gin c0002.gin ... up to let's say 500. I need to add to the end of each file the next text:
species
Ca core 2.00000000
Co core 2.00000000
C core 1.34353898
O core 1.01848700
O shel -2.13300000
buck intra
O core O core 4030.3000 0.245497 0.00000000 0.00 2.50 1 0 0
buck
Ca core O shel 2154.0600 0.289118 0.00000000 0.00 10.00 1 0 0
Co core O shel 1095.6000 0.286300 0.00000000 0.00 10.00 1 0 0
Ca core C core 120000000.000 0.120000 0.00000000 0.00 10.00 1 0 0
Co core C core 95757575.760 0.120000 0.00000000 0.00 10.00 1 0 0
buck inter
O shel O shel 64242.454 0.198913 21.843570 0.00 15.00 1 0 0
morse intra bond
C core O core 5.0000000 2.5228 1.19820 0.0000 1 0
three
C core O core O core 1.7995 120.00
outofplane bond intra
C cor O cor O cor O cor 8.6892 360.0
spring
O 52.740087
I want just a script to do that.
Furthermore, the files are in folder called "CALCS" and I wanted to move each file to another folder inside CALCS called "001" for c0001.gin, "002" for file c0002.gin and so on.
Thanks in advance
#!/bin/sh
text="${1:?Usage: $0 <text> <file>...}"
shift
files="${#}"
for file in $files
do
echo "$text" >> "$file"
done

BioPython: Residues size differ from position

I'm currently working with a data set of PDBs and I'm interested in the sizes of the residues (number of atom per residue). I realized the number of atoms -len(residue.child_list) - differed from residues in different proteins even though being the same residue. For example: Residue 'LEU' having 8 atoms in one protein but having 19 in another!
My guess is an error in the PDB or in the PDBParser(), nevertheless the differences are huge!
For example in the case of the molecule 3OQ2:
r = model['B'][88]
r1 = model['B'][15] # residue at chain B position 15
In [287]: r.resname
Out[287]: 'VAL'
In [288]: r1.resname
Out[288]: 'VAL'
But
In [274]: len(r.child_list)
Out[274]: 16
In [276]: len(r1.child_list)
Out[276]: 7
So even within a single molecule there's difference in the number of atoms. I'd like to know if this is normal biologically, or if there's something wrong. Thank you.
strong text
I just looked at the PDB provided and the difference is really the fact that for the first VAL (88) there are atomic coordinates for Hydrogens (H) whereas for the other VAL (15) there isn't.
ATOM 2962 N VAL B 88 33.193 42.159 23.916 1.00 11.01 N
ANISOU 2962 N VAL B 88 1516 955 1712 56 -227 -128 N
ATOM 2963 CA VAL B 88 33.755 43.168 24.800 1.00 12.28 C
ANISOU 2963 CA VAL B 88 1782 1585 1298 356 -14 286 C
ATOM 2964 C VAL B 88 35.255 43.284 24.530 1.00 12.91 C
ANISOU 2964 C VAL B 88 1661 1672 1573 -249 0 -435 C
ATOM 2965 O VAL B 88 35.961 42.283 24.451 1.00 14.78 O
ANISOU 2965 O VAL B 88 1897 1264 2453 30 -293 21 O
ATOM 2966 CB VAL B 88 33.524 42.841 26.286 1.00 12.81 C
ANISOU 2966 CB VAL B 88 1768 1352 1747 -50 -221 -304 C
ATOM 2967 CG1 VAL B 88 34.166 43.892 27.160 1.00 16.03 C
ANISOU 2967 CG1 VAL B 88 2292 1980 1819 -147 73 -8 C
ATOM 2968 CG2 VAL B 88 32.020 42.727 26.586 1.00 17.67 C
ANISOU 2968 CG2 VAL B 88 2210 2728 1774 -363 -401 83 C
ATOM 2969 H VAL B 88 33.642 41.425 23.899 1.00 13.21 H
ATOM 2970 HA VAL B 88 33.340 44.035 24.608 1.00 14.73 H
ATOM 2971 HB VAL B 88 33.941 41.979 26.492 1.00 15.37 H
ATOM 2972 HG11 VAL B 88 34.011 43.670 28.081 1.00 19.23 H
ATOM 2973 HG12 VAL B 88 35.110 43.912 26.983 1.00 19.23 H
ATOM 2974 HG13 VAL B 88 33.777 44.746 26.959 1.00 19.23 H
ATOM 2975 HG21 VAL B 88 31.902 42.523 27.516 1.00 21.20 H
ATOM 2976 HG22 VAL B 88 31.596 43.562 26.377 1.00 21.20 H
ATOM 2977 HG23 VAL B 88 31.647 42.026 26.047 1.00 21.20 H
I would go about filtering out these atoms for every residue in analysis. Then you should almost always get the same number of atoms. As someone mentioned the other thing you have to consider is what Biopython call 'disordered residues'. These are residues for which you have more than one alternative location for the atoms in the crystal lattice (they call this 'altloc'). Sorting this out should solve your problem.
Let me know if you need help with filtering out these atoms.
Fábio

Check if string exist in non-consecutive lines in a given column

I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}

Resources