I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python
Related
I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.
I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.
I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.
I am currently making my first python effort, a modification of some code written by a friend. I am using python 2.6.6. The original piece of code, which works, extracts information from a log file of data from donations made by credit card to my nonprofit. My new version, should it one day work, will perform the same task for donations that were made by paypal. The log files are similar, but have different field names and other differences.
The error messages I'm getting are:
Traceback (most recent call last):
File "../logparse-paypal-1.py", line 196, in
convert_log(sys.argv[1], sys.argv[2], access_ids)
File "../logparse-paypal-1.py", line 170, in convert_log
output = [f(record, access_ids) for f in output_fns]
TypeError: 'str' object is not callable
I've read some of the posts on this forum related to this error message, but so far I'm still at sea. I can't find any consequential differences between the portions of my code that related to the likely problem object (access_ids) and the code that I started with. All I did related to the access_ids table was to remove some lines that printed problems the script finds with the table that caused it to ignore some data. Perhaps I changed a character or something while doing that, but I've looked and so far can't find anything.
The portion of the code that is producing these error messages is the following:
# Use the output functions configured above to convert the
# transaction record into a list of outputs to be emitted to
# the CSV output file.
print "Converting %s at %s to CSV" % (record["type"], record["time"])
output = [f(record, access_ids) for f in output_fns]
j = 0
while j < len(output):
os.write(csv_fd, output[j])
if j < len(output) - 1:
os.write(csv_fd, ",")
else:
os.write(csv_fd, "\n")
j += 1
convert_count += 1
print "Converted %d approved transactions to CSV format, skipped %d non-approved transactions" % (convert_count, skip_count)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: logparse.py INPUT_FILE OUTPUT_FILE [ACCESS_IDS_FILE]"
print
print " INPUT_FILE Silent post log containing transaction records (must exist)"
print " OUTPUT_FILE Filename for the CSV file to be created (must not exist, will be created)"
print " ACCESS_IDS_FILE List of Access IDs and email addresses (optional, must exist if specified)"
sys.exit(-1)
access_ids = {}
if len(sys.argv) > 3:
access_ids = load_access_ids(sys.argv[3])
convert_log(sys.argv[1], sys.argv[2], access_ids)
Line 170 is this one:
output = [f(record, access_ids) for f in output_fns]
and line 196 is this one:
convert_log(sys.argv[1], sys.argv[2], access_ids)
The access_ids definition, possibly related to the problem, is this:
def access_id_fn(record, access_ids):
if "payer_email" in record and len(record["payer_email"]) > 0:
if record["payer_email"] in access_ids:
return '"' + access_ids[record["payer_email"]] + '"'
else:
return ""
else:
return ""
AND
def load_access_ids(filename):
print "Loading Access IDs from %s..." % filename
access_ids = {}
for line in open(filename, "r"):
line = line.rstrip()
access_id, email = [s.strip() for s in line.split(None, 1)]
if not email_address.match(email):
continue
if email in access_ids:
access_ids[string.strip(email)] = string.strip(access_id)
return access_ids
Thanks in advance for any advice with this.
Dave
I'm not seeing anything right off hand, but you did mention that the log files were similar and I take that to mean that there are differences between the two.
Can you post a line from each?
I would double check the data in the log files and make sure what you think is being read in is correct. This definitely appears to me like a piece of data is being read in, but somewhere it is breaking what the code is expecting.
Suppose I have a list in a text file which is as follows -
TaskB_115
TaskB_19
TaskB_105
TaskB_13
TaskB_10
TaskB_0_A_1
TaskB_17
TaskB_114
TaskB_110
TaskB_0_A_5
TaskB_16
TaskB_12
TaskB_113
TaskB_15
TaskB_103
TaskB_2
TaskB_18
TaskB_106
TaskB_11
TaskB_14
TaskB_104
TaskB_112
TaskB_107
TaskB_0_A_4
TaskB_102
TaskB_100
TaskB_109
TaskB_101
TaskB_0_A_2
TaskB_0_A_3
TaskB_116
TaskB_1_A_0
TaskB_111
TaskB_108
If I sort in vim with command %sort, it gives me output as -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_10
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_11
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_1_A_0
TaskB_2
But I would like to have the output as follows -
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_3
TaskB_0_A_4
TaskB_0_A_5
TaskB_1_A_0
TaskB_2
TaskB_10
TaskB_11
TaskB_12
TaskB_13
TaskB_14
TaskB_15
TaskB_16
TaskB_17
TaskB_18
TaskB_19
TaskB_100
TaskB_101
TaskB_102
TaskB_103
TaskB_104
TaskB_105
TaskB_106
TaskB_107
TaskB_108
TaskB_109
TaskB_110
TaskB_111
TaskB_112
TaskB_113
TaskB_114
TaskB_115
TaskB_116
Note I just wrote this list to demonstrate the problem. I could generate the list in VIM. But I want to do it for other things as well in VIM.
With [n] sorting is done on the first decimal number
in the line (after or inside a {pattern} match).
One leading '-' is included in the number.
try this command:
sor n
and you don't need the %, sort sorts all lines if no range was given.
EDIT
as commented by OP, if you have:
TaskB_0_A_1
TaskB_0_A_2
TaskB_0_A_4
TaskB_0_A_3
TaskB_0_A_5
TaskB_1_A_0
you could try:
sor n /.*_\ze\d*/
or
sor nr /\d*$/
EDIT2
for newly edited question, this line may give you expected output based on your example data:
sor nr /\d*$/|sor n