How to build an empirical codon substitution matrix from a multiple sequence alignment - matrix

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Related

Check if a value is between certain values in row of a file

I want to read a file and extract only those files that contain a number between a range in the fourth column.
For example, in this line I would like to know if 5240 is between 5220 and 5240.
MTB_anc RefSeq CDS 5240 7267 . + 0 ID=cds4;Parent=gene4;Dbxref=Genbank:NP_214519.2,GeneID:887081;Name=NP_214519.2;Note=Belongs to the type II topoisomerase family.;gbkey=CDS;gene=gyrB;product=DNA gyrase subunit B;protein_id=NP_214519.2;transl_table=11
I guess I should make a list with each element of the line and index that position but I don't get how to search an int in a string.
I am using Python 2.
Your approach is good. You are almost there.
An error can be caught when converting a string to an integer:
lines = []
with open(fname) as fp:
for line in fp:
tokens = line.split('\t')
try:
value = int(tokens[3])
if 5220 <= value <= 5240:
lines.append(tokens)
except ValueError as err:
continue
But it is also possible to test the content beforehand:
import re
lines = []
with open(fname) as fp:
for line in fp:
tokens = line.split('\t')
if re.match(r'^\d+$', tokens[3]) and 5220 <= int(tokens[3]) <= 5240:
lines.append(tokens)
For a more suitable solution, the question arises as to which values the columns can assume.

Steganography program - converting python 2 to 3, syntax error in: base64.b64decode("".join(chars))

I have problem with the syntax in the last part of steg program. I tried to convert python 2 version (of the working code) to python 3, and this is the last part of it:
flag = base64.b64decode("".join(chars)) <- error
print(flag)
The program 1. encrypts the message in the Last Significiant Bits of the image as saves it as a new image. Then 2.decrypts the message, which is stored in "flag", and prints it.
* can the error be caused by the wrong type of input?:
message = input("Your message: ")
BELOW: UNHIDING PROGRAM
#coding: utf-8
import base64
from PIL import Image
image = Image.open("after.png")
extracted = ''
pixels = image.load()
#Iterating in 1st row
for x in range(0,image.width):
r,g,b = pixels[x,0]
# Storing LSB of each color
extracted += bin(r)[-1]
extracted += bin(g)[-1]
extracted += bin(b)[-1]
chars = []
for i in range(len(extracted)/8):
byte = extracted[i*8:(i+1)*8]
chars.append(chr(int(''.join([str(bit) for bit in byte]), 2)))
flag = base64.b64decode(''.join(chars))
print flag
BELOW: HIDING PROGRAM:
import bitarray
import base64
from PIL import Image
with Image.open('before.png') as im:
pixels=im.load()
message = input("Your message: ")
encoded_message = base64.b64encode(message.encode('utf-8'))
#Convert the message into an array of bits
ba = bitarray.bitarray()
ba.frombytes(encoded_message)
bit_array = [int(i) for i in ba]
#Duplicate the original picture
im = Image.open("before.png")
im.save("after.png")
im = Image.open("after.png")
width, height = im.size
pixels = im.load()
#Hide message in the first row
i = 0
for x in range(0,width):
r,g,b = pixels[x,0]
#print("[+] Pixel : [%d,%d]"%(x,0))
#print("[+] \tBefore : (%d,%d,%d)"%(r,g,b))
#Default values in case no bit has to be modified
new_bit_red_pixel = 255
new_bit_green_pixel = 255
new_bit_blue_pixel = 255
if i<len(bit_array):
#Red pixel
r_bit = bin(r)
r_last_bit = int(r_bit[-1])
r_new_last_bit = r_last_bit & bit_array[i]
new_bit_red_pixel = int(r_bit[:-1]+str(r_new_last_bit),2)
i += 1
if i<len(bit_array):
#Green pixel
g_bit = bin(g)
g_last_bit = int(g_bit[-1])
g_new_last_bit = g_last_bit & bit_array[i]
new_bit_green_pixel = int(g_bit[:-1]+str(g_new_last_bit),2)
i += 1
if i<len(bit_array):
#Blue pixel
b_bit = bin(b)
b_last_bit = int(b_bit[-1])
b_new_last_bit = b_last_bit & bit_array[i]
new_bit_blue_pixel = int(b_bit[:-1]+str(b_new_last_bit),2)
i += 1
pixels[x,0] = (new_bit_red_pixel,new_bit_green_pixel,new_bit_blue_pixel)
#print("[+] \tAfter: (%d,%d,%d)"%(new_bit_red_pixel,new_bit_green_pixel,new_bit_blue_pixel))
im.save('after.png')
error
ValueError: string argument should contain only ASCII characters
help for base64.b64decode says:
b64decode(s, altchars=None, validate=False)
Decode the Base64 encoded bytes-like object or ASCII string s.
...
Considering that in Python 2 there were "normal" strs and unicode-strs (u-prefixed), I suggest taking closer look at what produce "".join(chars). Does it contain solely ASCII characters?
I suggest adding:
print("Codes:",[ord(c) for c in chars])
directly before:
flag = base64.b64decode("".join(chars))
If there will be number >127 inside codes, that mean it might not work as it is fit only for pure ASCII strs.

Get unicode block element based on matrix

A unique question I guess, given these unciode block elements:
https://en.wikipedia.org/wiki/Block_Elements
I want to get the relevant block element based on the matrix I get, so
11
01 will give ▜
00
10 will give ▖
and so on
I managed to do this in python, but I wonder if anyone got a more elegant solution.
from itertools import product
elements = [0, 1]
a = product(elements, repeat=2)
b = product(a, repeat=2)
matrices = [c for c in b]
"""
Matrices generated possiblities
00 00 00 00 01 01 01 01 10 10 10 10 11 11 11 11
00 01 10 11 00 01 10 11 00 01 11 10 00 01 10 11
"""
blocks = [' ', '▗', '▖', '▄', '▝', '▐', '▞', '▟', '▘', '▚', '▙', '▌', '▀', '▜', '▛', '█']
given = (
(0,1),
(1,0)
)
print(blocks[matrices.index(given)])
output: ▞
These characters, although existing, were not meant to have a direct correlation
of numbers-to-set-1/4 blocks.
So, I have a solution in a published package, and it is not necessarily
more "elegant" than yours, as it is far more verbose.
However, the code around it allows one to "draw" on a text terminal
using these 1/4 blocks as pixels, in a somewhat clean API.
So, this is the class I use to set/reset pixels in a character block. The relevant methods can be used straight from the class, and they take the"pixel coordinates", and the current character block upon which to set or reset the addressed pixel. The code instantiates the class just to be able to use the in operator to check for block-characters.
The project can be installed with "pip install terminedia".
The function and class bellow, extracted from the project, will work in standalone to do the same as you do:
# Snippets from jsbueno/terminedia, v. 0.2.0
def _mirror_dict(dct):
"""Creates a new dictionary exchanging values for keys
Args:
- dct (mapping): Dictionary to be inverted
"""
return {value: key for key, value in dct.items()}
class BlockChars_:
"""Used internaly to emulate pixel setting/resetting/reading inside 1/4 block characters
Contains a listing and other mappings of all block characters used in order, so that
bits in numbers from 0 to 15 will match the "pixels" on the corresponding block character.
Although this class is purposed for internal use in the emulation of
a higher resolution canvas, its functions can be used by any application
that decides to manipulate block chars.
The class itself is stateless, and it is used as a single-instance which
uses the name :any:`BlockChars`. The instance is needed so that one can use the operator
``in`` to check if a character is a block-character.
"""
EMPTY = " "
QUADRANT_UPPER_LEFT = '\u2598'
QUADRANT_UPPER_RIGHT = '\u259D'
UPPER_HALF_BLOCK = '\u2580'
QUADRANT_LOWER_LEFT = '\u2596'
LEFT_HALF_BLOCK = '\u258C'
QUADRANT_UPPER_RIGHT_AND_LOWER_LEFT = '\u259E'
QUADRANT_UPPER_LEFT_AND_UPPER_RIGHT_AND_LOWER_LEFT = '\u259B'
QUADRANT_LOWER_RIGHT = '\u2597'
QUADRANT_UPPER_LEFT_AND_LOWER_RIGHT = '\u259A'
RIGHT_HALF_BLOCK = '\u2590'
QUADRANT_UPPER_LEFT_AND_UPPER_RIGHT_AND_LOWER_RIGHT = '\u259C'
LOWER_HALF_BLOCK = '\u2584'
QUADRANT_UPPER_LEFT_AND_LOWER_LEFT_AND_LOWER_RIGHT = '\u2599'
QUADRANT_UPPER_RIGHT_AND_LOWER_LEFT_AND_LOWER_RIGHT = '\u259F'
FULL_BLOCK = '\u2588'
# This depends on Python 3.6+ ordered behavior for local namespaces and dicts:
block_chars_by_name = {key: value for key, value in locals().items() if key.isupper()}
block_chars_to_name = _mirror_dict(block_chars_by_name)
blocks_in_order = {i: value for i, value in enumerate(block_chars_by_name.values())}
block_to_order = _mirror_dict(blocks_in_order)
def __contains__(self, char):
"""True if a char is a "pixel representing" block char"""
return char in self.block_chars_to_name
#classmethod
def _op(cls, pos, data, operation):
number = cls.block_to_order[data]
index = 2 ** (pos[0] + 2 * pos[1])
return operation(number, index)
#classmethod
def set(cls, pos, data):
""""Sets" a pixel in a block character
Args:
- pos (2-sequence): coordinate of the pixel inside the character
(0,0) is top-left corner, (1,1) bottom-right corner and so on)
- data: initial character to be composed with the bit to be set. Use
space ("\x20") to start with an empty block.
"""
op = lambda n, index: n | index
return cls.blocks_in_order[cls._op(pos, data, op)]
#classmethod
def reset(cls, pos, data):
""""resets" a pixel in a block character
Args:
- pos (2-sequence): coordinate of the pixel inside the character
(0,0) is top-left corner, (1,1) bottom-right corner and so on)
- data: initial character to be composed with the bit to be reset.
"""
op = lambda n, index: n & (0xf - index)
return cls.blocks_in_order[cls._op(pos, data, op)]
#classmethod
def get_at(cls, pos, data):
"""Retrieves whether a pixel in a block character is set
Args:
- pos (2-sequence): The pixel coordinate
- data (character): The character were to look at blocks.
Raises KeyError if an invalid character is passed in "data".
"""
op = lambda n, index: bool(n & index)
return cls._op(pos, data, op)
#: :any:`BlockChars_` single instance: enables ``__contains__``:
BlockChars = BlockChars_()
After pasting only this in the terminal it is possible to do:
In [131]: pixels = BlockChars.set((0,0), " ")
In [132]: print(BlockChars.set((1,1), pixels))
# And this internal "side-product" is closer to what you have posted:
In [133]: BlockChars.blocks_in_order[0b1111]
Out[133]: '█'
In [134]: BlockChars.blocks_in_order[0b1010]
Out[134]: '▐'
The project at https://github.com/jsbueno/terminedia have a complete
drawing API do use these as pixels in an ANSI text terminal -
including bezier curves, filled ellipses, and RGB image display
(check the "examples" folder)

Extract multiple protein sequences from a Protein Data Bank along with Secondary Structure

I want to extract protein sequences and their corresponding secondary structure from any Protein Data bank, say RCSB. I just need short sequences and their secondary structure. Something like,
ATRWGUVT Helix
It is fine even if the sequences are long, but I want a tag at the end that denotes its secondary structure. Is there any programming tool or anything available for this.
As I've shown above I want only this much minimal information. How can I achieve this?
from Bio.PDB import *
from distutils import spawn
Extract sequence:
def get_seq(pdbfile):
p = PDBParser(PERMISSIVE=0)
structure = p.get_structure('test', pdbfile)
ppb = PPBuilder()
seq = ''
for pp in ppb.build_peptides(structure):
seq += pp.get_sequence()
return seq
Extract secondary structure with DSSP as explained earlier:
def get_secondary_struc(pdbfile):
# get secondary structure info for whole pdb.
if not spawn.find_executable("dssp"):
sys.stderr.write('dssp executable needs to be in folder')
sys.exit(1)
p = PDBParser(PERMISSIVE=0)
ppb = PPBuilder()
structure = p.get_structure('test', pdbfile)
model = structure[0]
dssp = DSSP(model, pdbfile)
count = 0
sec = ''
for residue in model.get_residues():
count = count + 1
# print residue,count
a_key = list(dssp.keys())[count - 1]
sec += dssp[a_key][2]
print sec
return sec
This should print both sequence and secondary structure.
You can use DSSP.
The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is:
H = α-helix
B = residue in isolated β-bridge
E = extended strand, participates in β ladder
G = 3-helix (310 helix)
I = 5 helix (π-helix)
T = hydrogen bonded turn
S = bend

Determining All Possibilities for a Random String?

I was hoping someone with better math capabilities would assist me in figuring out the total possibilities for a string given it's length and character set.
i.e. [a-f0-9]{6}
What are the possibilities for this pattern of random characters?
It is equal to the number of characters in the set raised to 6th power.
In Python (3.x) interpreter:
>>> len("0123456789abcdef")
16
>>> 16**6
16777216
>>>
EDIT 1:
Why 16.7 million? Well, 000000 ... 999999 = 10^6 = 1M, 16/10 = 1.6 and
>>> 1.6**6
16.77721600000000
* EDIT 2:*
To create a list in Python, do: print(['{0:06x}'.format(i) for i in range(16**6)])
However, this is too huge. Here is a simpler, shorter example:
>>> ['{0:06x}'.format(i) for i in range(100)]
['000000', '000001', '000002', '000003', '000004', '000005', '000006', '000007', '000008', '000009', '00000a', '00000b', '00000c', '00000d', '00000e', '00000f', '000010', '000011', '000012', '000013', '000014', '000015', '000016', '000017', '000018', '000019', '00001a', '00001b', '00001c', '00001d', '00001e', '00001f', '000020', '000021', '000022', '000023', '000024', '000025', '000026', '000027', '000028', '000029', '00002a', '00002b', '00002c', '00002d', '00002e', '00002f', '000030', '000031', '000032', '000033', '000034', '000035', '000036', '000037', '000038', '000039', '00003a', '00003b', '00003c', '00003d', '00003e', '00003f', '000040', '000041', '000042', '000043', '000044', '000045', '000046', '000047', '000048', '000049', '00004a', '00004b', '00004c', '00004d', '00004e', '00004f', '000050', '000051', '000052', '000053', '000054', '000055', '000056', '000057', '000058', '000059', '00005a', '00005b', '00005c', '00005d', '00005e', '00005f', '000060', '000061', '000062', '000063']
>>>
EDIT 3:
As a function:
def generateAllHex(numDigits):
assert(numDigits > 0)
ceiling = 16**numDigits
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
print(formatStr.format(i))
This will take a while to print at numDigits = 6.
I recommend dumping this to file instead like so:
def generateAllHex(numDigits, fileName):
assert(numDigits > 0)
ceiling = 16**numDigits
with open(fileName, 'w') as fout:
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
fout.write(formatStr.format(i))
If you are just looking for the number of possibilities, the answer is (charset.length)^(length). If you need to actually generate a list of the possibilities, just loop through each character, recursively generating the remainder of the string.
e.g.
void generate(char[] charset, int length)
{
generate("",charset,length);
}
void generate(String prefix, char[] charset, int length)
{
for(int i=0;i<charset.length;i++)
{
if(length==1)
System.out.println(prefix + charset[i]);
else
generate(prefix+i,charset,length-1);
}
}
The number of possibilities is the size of your alphabet, to the power of the size of your string (in the general case, of course)
assuming your string size is 4: _ _ _ _ and your alphabet = { 0 , 1 }:
there are 2 possibilities to put 0 or 1 in the first place, second place and so on.
so it all sums up to: alphabet_size^String_size
first: 000000
last: ffffff
This matches hexadecimal numbers.
For any given set of possible values, the number of permutations is the number of possibilities raised to the power of the number of items.
In this case, that would be 16 to the 6th power, or 16777216 possibilities.

Resources