BioPython: Residues size differ from position - bioinformatics

I'm currently working with a data set of PDBs and I'm interested in the sizes of the residues (number of atom per residue). I realized the number of atoms -len(residue.child_list) - differed from residues in different proteins even though being the same residue. For example: Residue 'LEU' having 8 atoms in one protein but having 19 in another!
My guess is an error in the PDB or in the PDBParser(), nevertheless the differences are huge!
For example in the case of the molecule 3OQ2:
r = model['B'][88]
r1 = model['B'][15] # residue at chain B position 15
In [287]: r.resname
Out[287]: 'VAL'
In [288]: r1.resname
Out[288]: 'VAL'
But
In [274]: len(r.child_list)
Out[274]: 16
In [276]: len(r1.child_list)
Out[276]: 7
So even within a single molecule there's difference in the number of atoms. I'd like to know if this is normal biologically, or if there's something wrong. Thank you.
strong text

I just looked at the PDB provided and the difference is really the fact that for the first VAL (88) there are atomic coordinates for Hydrogens (H) whereas for the other VAL (15) there isn't.
ATOM 2962 N VAL B 88 33.193 42.159 23.916 1.00 11.01 N
ANISOU 2962 N VAL B 88 1516 955 1712 56 -227 -128 N
ATOM 2963 CA VAL B 88 33.755 43.168 24.800 1.00 12.28 C
ANISOU 2963 CA VAL B 88 1782 1585 1298 356 -14 286 C
ATOM 2964 C VAL B 88 35.255 43.284 24.530 1.00 12.91 C
ANISOU 2964 C VAL B 88 1661 1672 1573 -249 0 -435 C
ATOM 2965 O VAL B 88 35.961 42.283 24.451 1.00 14.78 O
ANISOU 2965 O VAL B 88 1897 1264 2453 30 -293 21 O
ATOM 2966 CB VAL B 88 33.524 42.841 26.286 1.00 12.81 C
ANISOU 2966 CB VAL B 88 1768 1352 1747 -50 -221 -304 C
ATOM 2967 CG1 VAL B 88 34.166 43.892 27.160 1.00 16.03 C
ANISOU 2967 CG1 VAL B 88 2292 1980 1819 -147 73 -8 C
ATOM 2968 CG2 VAL B 88 32.020 42.727 26.586 1.00 17.67 C
ANISOU 2968 CG2 VAL B 88 2210 2728 1774 -363 -401 83 C
ATOM 2969 H VAL B 88 33.642 41.425 23.899 1.00 13.21 H
ATOM 2970 HA VAL B 88 33.340 44.035 24.608 1.00 14.73 H
ATOM 2971 HB VAL B 88 33.941 41.979 26.492 1.00 15.37 H
ATOM 2972 HG11 VAL B 88 34.011 43.670 28.081 1.00 19.23 H
ATOM 2973 HG12 VAL B 88 35.110 43.912 26.983 1.00 19.23 H
ATOM 2974 HG13 VAL B 88 33.777 44.746 26.959 1.00 19.23 H
ATOM 2975 HG21 VAL B 88 31.902 42.523 27.516 1.00 21.20 H
ATOM 2976 HG22 VAL B 88 31.596 43.562 26.377 1.00 21.20 H
ATOM 2977 HG23 VAL B 88 31.647 42.026 26.047 1.00 21.20 H
I would go about filtering out these atoms for every residue in analysis. Then you should almost always get the same number of atoms. As someone mentioned the other thing you have to consider is what Biopython call 'disordered residues'. These are residues for which you have more than one alternative location for the atoms in the crystal lattice (they call this 'altloc'). Sorting this out should solve your problem.
Let me know if you need help with filtering out these atoms.
Fábio

Related

A possible bug in PDB module of Biopython (getting wrong coordinates for one of the residues)

I beilive that I have found a possible bug within the PDB module of Biopython. In short, I have been looking at ligands within the 2r09 structure from PDB. This structure contains two copies of identical "4IP" hetero residues, which are both located closely to the protein molecules. You can see it very clearly here: https://www.rcsb.org/3d-view/2R09.
However, during the parsing process something strange happens, and the coordinates of one of 4IP residues change dramatically, so that it is no longer where it should be. In fact, it turns out noticeably shifted from its original position. I have manually compared the coordinates from within the pdb file and the ones I got using the biopython, and indeed they do not match. Moreover, when I save the opened structure from within the biopython without any additional manipulations done to it, I get wrong results, which I confirmed by visualizing two pdb files before and after opening and saving the structure with byopython.
By the way, the same can be done by using the nglview library for example, which lets you visualize structure inside the jupyter notebook. Once again, if the structure is loaded separately (not with biopython), it looks perfectly fine, which cannot be said about the structure loaded with biopython.
Here are the original coordinates (in bold) for the first 4 atoms of the 4IP residues for the original pdb files:
HETATM 5636 C1 4IP A 405 80.967 85.113 26.680 1.00 22.42 C
HETATM 5637 O1 4IP A 405 82.327 85.039 27.129 1.00 23.40 O
HETATM 5638 C2 4IP A 405 80.917 85.791 25.309 1.00 22.60 C
HETATM 5639 O2 4IP A 405 81.463 87.121 25.385 1.00 20.92 O
Here is what I get after saving the structure with biopython:
HETATM 1 C1 4IP A 405 30.570 61.217 -13.415 1.00 22.42 C
HETATM 2 O1 4IP A 405 29.672 60.422 -14.201 1.00 23.40 O
HETATM 3 C2 4IP A 405 30.182 61.120 -11.938 1.00 22.60 C
HETATM 4 O2 4IP A 405 28.836 61.592 -11.740 1.00 20.92 O
It is highly possible that I just did something wrong here, but I can't find what exactly.
copying from How to save each ligand from a PDB file separately with Bio.PDB? approach:
from Bio import PDB
from Bio.PDB import PDBIO, Select
class ResidueSelect(Select):
def __init__(self, chain, residue):
self.chain = chain
self.residue = residue
def accept_chain(self, chain):
return chain.id == self.chain.id
def accept_residue(self, residue):
""" Recognition of heteroatoms - Remove water molecules """
return residue == self.residue
def extract_ligands(file, ligand):
parser = PDB.PDBParser(PERMISSIVE=1, QUIET=1)
structure: PDB.Structure.Structure = parser.get_structure(file.split('.')[-1], file)
print('ooooooo : ' ,file.split('.')[0])
io = PDBIO()
io.set_structure(structure)
i=1
for model in structure:
for chain in model:
for residue in chain:
if residue.get_resname() == ligand:
print(f"saving {chain} {residue}")
io.save(f"lig_{file.split('.')[0]}_{i}.pdb", ResidueSelect(chain, residue))
i += 1
extract_ligands("2r09.pdb", '4IP')
outputs:
lig_2r09_1 :
HETATM 1 C1 4IP A 405 80.967 85.113 26.680 1.00 22.42 C
HETATM 2 O1 4IP A 405 82.327 85.039 27.129 1.00 23.40 O
HETATM 3 C2 4IP A 405 80.917 85.791 25.309 1.00 22.60 C
HETATM 4 O2 4IP A 405 81.463 87.121 25.385 1.00 20.92 O
HETATM 5 C3 4IP A 405 79.465 85.823 24.827 1.00 20.89 C
........................
lig_2r09_2 :
HETATM 1 C1 4IP B 400 19.832 75.384 -3.106 1.00 26.91 C
HETATM 2 O1 4IP B 400 18.852 74.503 -3.643 1.00 28.03 O
HETATM 3 C2 4IP B 400 19.809 75.360 -1.573 1.00 25.72 C
HETATM 4 O2 4IP B 400 18.531 75.764 -1.075 1.00 26.15 O
...................
Again using >Biopython 1.78 seems to get right coords
OK, working in Biopython 1,79, I tried to load your pdb : 2r09 ,
borrowing from https://github.com/agpe/biopython-ligands
active_site.py, modified into :
import Bio.PDB
import numpy
from Bio.PDB import PDBIO
cnt = 0
def residue_dist_to_ligand(protein_residue, ligand_residue) :
#Returns distance from the protein C-alpha to the closest ligand atom
dist = []
for atom in ligand_residue :
if "CA" in protein_residue:
vector = protein_residue["CA"].coord - atom.coord
dist.append(numpy.sqrt(numpy.sum(vector * vector)))
return min(dist)
def get_ligand_by_name(residue_name, model):
#Extract ligands from all chains in a model by its name
global ligands
ligands = {}
chains = model.child_dict
for c in chains:
ligands[c] = []
for protein_res in chains[c].child_list:
if protein_res.resname == residue_name:
ligands[c].append(protein_res)
print('ligands:', ligands)
return ligands
def get_ligand_by_chain(chain_name, model):
#Extract all ligand residues from given chain name
global ligands
ligands = {}
ligands[chain_name] = []
chains = model.child_dict
for protein_res in chains[chain_name].child_list:
ligands[chain_name].append(protein_res)
return ligands
def active_site(ligands, distance, model):
# Prints out residues located at a given distance from ligand
chains = model.child_dict
for group in ligands.values():
for ligand_res in group:
print("ligand residue: "+ligand_res.resname, ligand_res.id[1])
for c in chains:
for protein_res in chains[c].child_list:
if protein_res not in group:
dist = residue_dist_to_ligand(protein_res, ligand_res)
if dist and dist < distance :
print(protein_res.resname, protein_res.id[1], dist)
def save_ligand(structure, filename):
print('ligands.items() : ', ligands.items())
otto = [x[0].id[0] for x in ligands.values()]
print([x[0].id[0] for x in ligands.values()])
nove= [x[0] for x in ligands.keys()]
print([x[0] for x in ligands.keys()])
# Saves ligand to a filename.pdb
Select = Bio.PDB.Select
class LigandSelect(Select):
def accept_residue(self, residue):
if str(residue.id[0]) in otto and str(residue.get_parent().id) in nove:
global cnt
cnt += 1
print('ligands ñ#'+str(cnt)+' : ', str(residue.id[0]) ,' from chin : ', str(residue.get_parent().id))
return 1
else:
return 0
io=PDBIO()
io.set_structure(structure)
io.save(filename+'.pdb', LigandSelect() , preserve_atom_numbering=False)
after more googling realized that def save_ligand(structure, filename) could be written in a shorter way:
def save_ligand(structure, filename):
# Saves ligand to a filename.pdb
Select = Bio.PDB.Select
class LigandSelect(Select):
def accept_residue(self, residue):
return residue in [item for sublist in ligands.values() for item in sublist]
io=PDBIO()
io.set_structure(structure)
io.save(filename+'.pdb', LigandSelect(), preserve_atom_numbering=False)
my code main.py :
from Bio import PDB
from Bio.PDB import PDBIO
import active_site as SV
parser = PDB.PDBParser(PERMISSIVE=1, QUIET=1)
structure: PDB.Structure.Structure = parser.get_structure("prova", "2r09.pdb")
ligand = SV.get_ligand_by_name('4IP', structure[0])
print('ligand -> ', ligand, type(ligand))
SV.save_ligand(structure[0], 'ligand')
I can load your PDB files and save both the ligands present in the structure
(same ligand for two different chains: A,B ; don't know if protein is monomer or dimer (i.e. monomer but dimeric in AU, actual dimer))
my results as ligand.pdb is:
HETATM 1 C1 4IP A 405 80.967 85.113 26.680 1.00 22.42 C
HETATM 2 O1 4IP A 405 82.327 85.039 27.129 1.00 23.40 O
HETATM 3 C2 4IP A 405 80.917 85.791 25.309 1.00 22.60 C
HETATM 4 O2 4IP A 405 81.463 87.121 25.385 1.00 20.92 O
HETATM 5 C3 4IP A 405 79.465 85.823 24.827 1.00 20.89 C
HETATM 6 O3 4IP A 405 79.392 86.282 23.475 1.00 20.17 O
HETATM 7 C4 4IP A 405 78.585 86.660 25.757 1.00 20.60 C
HETATM 8 O4 4IP A 405 77.224 86.534 25.358 1.00 20.32 O
HETATM 9 C5 4IP A 405 78.675 86.137 27.199 1.00 21.70 C
HETATM 10 O5 4IP A 405 77.999 87.108 28.007 1.00 20.84 O
HETATM 11 C6 4IP A 405 80.116 85.898 27.686 1.00 21.33 C
HETATM 12 O6 4IP A 405 80.130 85.209 28.962 1.00 21.36 O
HETATM 13 P1 4IP A 405 83.280 83.762 26.840 1.00 25.39 P
HETATM 14 O1P 4IP A 405 84.620 84.387 27.067 1.00 24.93 O
HETATM 15 O2P 4IP A 405 83.011 83.316 25.424 1.00 24.90 O
HETATM 16 O3P 4IP A 405 82.883 82.738 27.868 1.00 26.91 O
HETATM 17 P3 4IP A 405 78.855 85.297 22.324 1.00 20.21 P
HETATM 18 O4P 4IP A 405 77.428 84.990 22.747 1.00 20.15 O
HETATM 19 O5P 4IP A 405 79.754 84.074 22.330 1.00 21.26 O
HETATM 20 O6P 4IP A 405 78.942 86.136 21.074 1.00 19.72 O
HETATM 21 P4 4IP A 405 76.408 87.690 24.600 1.00 19.61 P
HETATM 22 O7P 4IP A 405 77.277 88.918 24.665 1.00 19.52 O
HETATM 23 O8P 4IP A 405 75.127 87.775 25.380 1.00 19.15 O
HETATM 24 O9P 4IP A 405 76.215 87.157 23.199 1.00 19.73 O
HETATM 25 P5 4IP A 405 77.266 86.724 29.383 1.00 21.18 P
HETATM 26 OPF 4IP A 405 76.447 85.480 29.103 1.00 19.82 O
HETATM 27 OPG 4IP A 405 76.396 87.922 29.662 1.00 21.85 O
HETATM 28 OPH 4IP A 405 78.385 86.527 30.363 1.00 20.70 O
TER 29 4IP A 405
HETATM 29 C1 4IP B 400 19.832 75.384 -3.106 1.00 26.91 C
HETATM 30 O1 4IP B 400 18.852 74.503 -3.643 1.00 28.03 O
HETATM 31 C2 4IP B 400 19.809 75.360 -1.573 1.00 25.72 C
HETATM 32 O2 4IP B 400 18.531 75.764 -1.075 1.00 26.15 O
HETATM 33 C3 4IP B 400 20.886 76.313 -1.046 1.00 25.51 C
HETATM 34 O3 4IP B 400 20.949 76.259 0.388 1.00 24.29 O
HETATM 35 C4 4IP B 400 20.660 77.747 -1.535 1.00 24.17 C
HETATM 36 O4 4IP B 400 21.786 78.546 -1.164 1.00 23.22 O
HETATM 37 C5 4IP B 400 20.563 77.819 -3.060 1.00 26.06 C
HETATM 38 O5 4IP B 400 20.179 79.156 -3.403 1.00 24.79 O
HETATM 39 C6 4IP B 400 19.613 76.789 -3.678 1.00 26.21 C
HETATM 40 O6 4IP B 400 19.845 76.713 -5.095 1.00 26.93 O
HETATM 41 P1 4IP B 400 19.148 72.969 -4.040 1.00 30.18 P
HETATM 42 O1P 4IP B 400 17.769 72.379 -3.959 1.00 29.62 O
HETATM 43 O2P 4IP B 400 19.703 72.997 -5.442 1.00 32.15 O
HETATM 44 O3P 4IP B 400 20.131 72.484 -3.004 1.00 30.58 O
HETATM 45 P3 4IP B 400 22.293 75.724 1.102 1.00 25.03 P
HETATM 46 O4P 4IP B 400 23.378 76.683 0.677 1.00 23.91 O
HETATM 47 O5P 4IP B 400 22.477 74.324 0.571 1.00 24.81 O
HETATM 48 O6P 4IP B 400 21.967 75.846 2.579 1.00 23.92 O
HETATM 49 P4 4IP B 400 21.769 79.638 0.027 1.00 24.76 P
HETATM 50 O7P 4IP B 400 20.352 79.738 0.497 1.00 22.44 O
HETATM 51 O8P 4IP B 400 22.374 80.892 -0.565 1.00 23.32 O
HETATM 52 O9P 4IP B 400 22.674 79.045 1.083 1.00 24.83 O
HETATM 53 P5 4IP B 400 20.543 79.836 -4.819 1.00 27.59 P
HETATM 54 OPF 4IP B 400 22.002 79.494 -5.029 1.00 26.64 O
HETATM 55 OPG 4IP B 400 20.334 81.296 -4.538 1.00 26.24 O
HETATM 56 OPH 4IP B 400 19.586 79.196 -5.799 1.00 25.90 O
TER 57 4IP B 400
END
the coordinates look similar to original pdb:
HETATM 5636 C1 4IP A 405 80.967 85.113 26.680 1.00 22.42 C
HETATM 5637 O1 4IP A 405 82.327 85.039 27.129 1.00 23.40 O
HETATM 5638 C2 4IP A 405 80.917 85.791 25.309 1.00 22.60 C
.....
HETATM 5706 C1 4IP B 400 19.832 75.384 -3.106 1.00 26.91 C
HETATM 5707 O1 4IP B 400 18.852 74.503 -3.643 1.00 28.03 O
HETATM 5708 C2 4IP B 400 19.809 75.360 -1.573 1.00 25.72 C
HETATM 5709 O2 4IP B 400 18.531 75.764 -1.075 1.00 26.15 O
check it out, to see if I didn't make any mistake. Can't you upgrade to 1.79 from 1.78 ?

Add heteroatom to pdb file

I am using Biopython to perform various operations on a pdb file. Subsequently I would like to add some new atoms to the Biopython structure object generated by Biopython. Is there a good/recommended way to do this in Python. It seems Biopython only provides options to write out existing elements of a pdb file and not to create new ones.
You could have a look at the Python package Biotite (https://www.biotite-python.org/), a package I am developing.
In the following example code, a PDB structure is downloaded, read and then an atom is added:
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io as strucio
# Download lysozyme structure for example
file_name = rcsb.fetch("1aki", "pdb", target_path=".")
# Read the file into Biotite's structure object (atom array)
atom_array = strucio.load_structure(file_name)
# Add an HETATM
atom = struc.Atom(
coord = [1.0, 2.0, 3.0],
chain_id = "A",
# The residue ID is the last ID in the file +1
res_id = atom_array.res_id[-1] + 1,
res_name = "ABC",
hetero = True,
atom_name = "CA",
element = "C"
)
atom_array += struc.array([atom])
# Save edited structure
strucio.save_structure("1aki_edited.pdb", atom_array)
The last lines of 1aki_edited.pdb:
...
HETATM 1075 O HOH A 203 12.580 21.214 5.006 1.00 0.000 O
HETATM 1076 O HOH A 204 19.687 23.750 -4.851 1.00 0.000 O
HETATM 1077 O HOH A 205 27.098 35.956 -12.358 1.00 0.000 O
HETATM 1078 O HOH A 206 37.255 9.634 10.002 1.00 0.000 O
HETATM 1079 O HOH A 207 43.755 23.843 8.038 1.00 0.000 O
HETATM 1080 CA ABC A 208 1.000 2.000 3.000 1.00 0.000 C
I have used RDKit to add and edit atoms in PDB-files succesfully. Below I've shown a small example of how to add a carbon atom to a PDB-file and creating a new .pdb-file
from rdkit import Chem
from rdkit.Chem import rdGeometry
prot = Chem.MolFromPDBFile("./3etr.pdb") #Read in the .pdb-file
protconf = prot.GetConformer() #create a conformer of the molecule
#create an editable mol-object
mw = Chem.RWMol(mol)
#create an editable conformer. This dictates the atoms coordinates and other attributes
mw_conf = mw.GetConformer()
#add a carbon atom to the editable mol. Returns the index of the new atom, which is the same as prot.GetNumAtoms() + 1
c_idx = mw.AddAtom(Chem.Atom(6))
#cartesian coordinates of the new atom. I think the Point3D object is not strictly necessary. but can be easier to handle in RDKit
coord = rdGeometry.Point3D(1.0, 2.0, 3.0)
#set the new coordinates
mw_conf.SetAtomPosition(c_idx, coord)
#save the edited PDB-file
pdb_out = Chem.MolToPDBFile(mw_conf, "_out.pdb")

Sum data in one column in a specific order in Spotfire

Does anyone know how to create a calculated column (in Spotfire) that will sum data in order of increasing values contained within another column?
For example, what would the expression be to Sum data in [P] in increasing order of [K], for each [Well]
Some example data:
Well Depth P K
A 85 0.191 108
A 85.5 0.192 102
A 87 0.17 49
A 88 0.184 47
A 89 0.192 50
B 298 0.215 177
B 298.5 0.2 177
B 300 .017 105
B 301 0.23 200
You can use:
Sum([P]) OVER (intersect([Well],AllPrevious([K])))
This returns the cumulative sum of P in order of K per Well in ascending order of K.
Well K P Cumulative Sum of P
A 47 0,184 0,184
A 49 0,17 0,354
A 50 0,192 0,546
A 102 0,192 0,738
A 108 0,191 0,929
B 105 0,017 0,017
B 177 0,215 0,432
B 177 0,2 0,432
B 200 0,23 0,662
Edit Based on OP's comment:
you can use to get the cumulative sum in descending order of K:
Sum([P]) OVER (intersect([Well],AllNExt([K])))

How to calculate classification error rate

Alright. Now this question is pretty hard. I am going to give you an example.
Now the left numbers are my algorithm classification and the right numbers are the original class numbers
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?
Or here another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.
This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D
Here's a longish example, a real confuson matrix with 10 input classes "0" - "9"
(handwritten digits),
and 10 output clusters labelled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, 415 of which are "8"s;
cluster B has 383 data points, 249 of which are "1"s; and so on.
The problem is that the output classes are scrambled, permuted;
they correspond in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
One could say that the "success rate" is
75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information —
here, that E and J both say "6", and no cluster says "9".
So, add up the biggest numbers in each column of the confusion matrix
and divide by the total.
But, how to count overlapping / missing clusters,
like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way
(doubt that the Hungarian algorithm
is used in practice).
Bottom line: don't throw away information; look at the whole confusion matrix.
NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set",
train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g.
Cross-validation.
You have to define the error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you're asking. In some clustering and machine learning algorithms you define the error metric and it minimizes it.
Take a look at this
https://en.wikipedia.org/wiki/Confusion_matrix
to get some ideas
You have to define a error metric to measure yourself. In your case, a simple method should be to find the properties mapping of your product as
p = properties(id)
where id is the product id, and p is likely be a vector with each entry of different properties. Then you can define the error function e (or distance) between two products as
e = d(p1, p2)
Sure, each properties must be evaluated to a number in this function. Then this error function can be used in the classification algorithm and learning.
In your second example, it seems that you treat the pair (203 7) as successful classification, so I think you have already a metric yourself. You may be more specific to get better answer.
Classification Error Rate(CER) is 1 - Purity (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Code of #john-colby
Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

Check if string exist in non-consecutive lines in a given column

I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}

Resources