sorting a file of list created by python with write - sorting

I have a file created by python3 using:
of.write("{:<6f} {:<10f} {:<18f} {:<10f}\n"
.format((betah), test, (torque*13605.698066), (mom)))
The output file looks like:
$cat pout
15.0 47.13 0.0594315908872 0.933333333334
25.0 29.07 0.143582198404 0.96
20.0 35.95 0.220373446813 0.95
5.0 124.12 0.230837577743 0.800090803982
4.0 146.71 0.239706979471 0.750671150402
0.5 263.24 0.239785533064 0.163953413739
1.0 250.20 0.240498520899 0.313035285499
Now, I want to sort the list.
The expected output of sorting will be:
25.0 29.07 0.143582198404 0.96
20.0 35.95 0.220373446813 0.95
15.0 47.13 0.0594315908872 0.933333333334
5.0 124.12 0.230837577743 0.800090803982
4.0 146.71 0.239706979471 0.750671150402
1.0 250.20 0.240498520899 0.313035285499
0.5 263.24 0.239785533064 0.163953413739
I tried this and tuples example in this but they are yielding the output as
['0.500000 263.240000 0.239786 0.163953 \n', '15.000000 47.130000 0.059432 0.933333 \n', '1.000000 250.200000 0.240499 0.313035 \n', '25.000000 29.070000 0.143582 0.960000 \n', '20.000000 35.950000 0.220373 0.950000 \n', '4.000000 146.710000 0.239707 0.750671 \n', '5.000000 124.120000 0.230838 0.800091 \n']
Please, don't try to match the numbers of input and output, because both of them are truncated for brevity.
As an example for my own try for the sorting with help from 1 is like:
f = open("tmp", "r")
lines = [line for line in f if line.strip()]
print(lines)
f.close()
Kindly help me sorting the file properly.

The problem you've found is that the strings are sorted alphabetically instead of numerically. What you need to do is convert each item from a string to a float, sort the list of floats, and then output as a string again.
I've recreated your file here, so you can see that I'm reading directly from a file.
pout = [
"15.0 47.13 0.0594315908872 0.933333333334",
"25.0 29.07 0.143582198404 0.96 ",
"20.0 35.95 0.220373446813 0.95 ",
"5.0 124.12 0.230837577743 0.800090803982",
"4.0 146.71 0.239706979471 0.750671150402",
"0.5 263.24 0.239785533064 0.163953413739",
"1.0 250.20 0.240498520899 0.313035285499"]
with open('test.txt', 'w') as thefile:
for item in pout:
thefile.write(str("{}\n".format(item)))
# Read in the file, stripping each line
lines = [line.strip() for line in open('test.txt')]
acc = []
# Loop through the list of lines, splitting the numbers at the whitespace
for strings in lines:
words = strings.split()
# Convert each item to a float
words = [float(word) for word in words]
acc.append(words)
# Sort the new list, reversing because you want highest numbers first
lines = sorted(acc, reverse=True)
# Save it to the file.
with open('test.txt', 'w') as thefile:
for item in lines:
thefile.write("{:<6} {:<10} {:<18} {:<10}\n".format(item[0], item[1], item[2], item[3]))
Also note that I use with open('test.txt', 'w') as thefile: as it automatically handles all opening and closing. Much more memory-safe.

Related

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Python: Can I grab the specific lines from a large file faster?

I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.

Ruby - How to subtract numbers of two files and save the result in one of them on a specified position?

I have 2 txt files with different strings and numbers in them splitted with ;
Now I need to subtract the
((number on position 2 in file1) - (number on position 25 in file2)) = result
Now I want to replace the (number on position 2 in file1) with the result.
I tried my code below but it only appends the number in the end of the file and its not the result of the calculation which got appended.
def calc
f1 = File.open("./file1.txt", File::RDWR)
f2 = File.open("./file2.txt", File::RDWR)
f1.flock(File::LOCK_EX)
f2.flock(File::LOCK_EX)
f1.each.zip(f2.each).each do |line, line2|
bg = line.split(";").compact.collect(&:strip)
bd = line2.split(";").compact.collect(&:strip)
n = bd[2].to_i - bg[25].to_i
f2.print bd[2] << n
#puts "#{n}" Only for testing
end
f1.flock(File::LOCK_UN)
f2.flock(File::LOCK_UN)
f1.close && f2.close
end
Use something like this:
lines1 = File.readlines('file1.txt').map(&:to_i)
lines2 = File.readlines('file2.txt').map(&:to_i)
result = lines1.zip(lines2).map do |value1, value2| value1 - value2 }
File.write('file1.txt', result.join(?\n))
This code load all files in memory, then calculate result and write it to first file.
FYI: If you want to use your code just save result to other file (i.e. result.txt) and at the end copy it to original file.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

How can I find all initializations in a text?

I have to find all initializations (captial letter words, such as SAP, JSON or XML) in my plain text files. Is there any ready-made script for this? Ruby, Python, Perl - the language doesn't matter. So far, I've found nothing.
Regards,
Stefan
Here you go:
perl -e 'for (<>) { for (m/\b([[:upper:]]{2,})\b/) { print "$1\n"; } }' textinput.txt
Grabs all all-uppercase words that are at least two characters long. I use [[:upper:]] instead of [A-Z] so that it works in any locale.
A simpler version of Conspicuous Compiler's answer uses the -p flag to cut out all that ugly loop code:
perl -p -e 'm/\b([[:upper:]]{2,})\b/' input.txt
A regular expression like /[A-Z]{2,}/ should do the trick.
Here's a Python 2.x solution that allows for digits (see example). Update: Code now works for Python 3.1, 3.0 and 2.1 to 2.6 inclusive.
dos-prompt>type find_acronyms.py
import re
try:
set
except NameError:
try:
from sets import Set as set # Python 2.3
except ImportError:
class set: # Python 2.2 and earlier
# VERY minimal implementation
def __init__(self):
self.d = {}
def add(self, element):
self.d[element] = None
def __str__(self):
return 'set(%s)' % self.d.keys()
word_regex = re.compile(r"\w{2,}", re.LOCALE)
# min length is 2 characters
def accumulate_acronyms(a_set, an_iterable):
# updates a_set in situ
for line in an_iterable:
for word in word_regex.findall(line):
if word.isupper() and "_" not in word:
a_set.add(word)
test_data = """
A BB CCC _DD EE_ a bb ccc k9 K9 A1
It's a CHARLIE FOXTROT, said MAJ Major Major USAAF RETD.
FBI CIA MI5 MI6 SDECE OGPU NKVD KGB FSB
BB CCC # duplicates
_ABC_DEF_GHI_ 123 666 # no acronyms here
"""
result = set()
accumulate_acronyms(result, test_data.splitlines())
print(result)
dos-prompt>\python26\python find_acronyms.py
set(['CIA', 'OGPU', 'BB', 'RETD', 'CHARLIE', 'FSB',
'NKVD', 'A1', 'SDECE', 'KGB', 'MI6', 'USAAF', 'K9', 'MAJ',
'MI5', 'FBI', 'CCC', 'FOXTROT'])
# Above output has had newlines inserted for ease of reading.
# Output from 3.0 & 3.1 differs slightly in presentation.
# Output from 2.1 differs in item order.

Resources