mysterious "'str' object is not callable" python error - python-2.6

I am currently making my first python effort, a modification of some code written by a friend. I am using python 2.6.6. The original piece of code, which works, extracts information from a log file of data from donations made by credit card to my nonprofit. My new version, should it one day work, will perform the same task for donations that were made by paypal. The log files are similar, but have different field names and other differences.
The error messages I'm getting are:
Traceback (most recent call last):
File "../logparse-paypal-1.py", line 196, in
convert_log(sys.argv[1], sys.argv[2], access_ids)
File "../logparse-paypal-1.py", line 170, in convert_log
output = [f(record, access_ids) for f in output_fns]
TypeError: 'str' object is not callable
I've read some of the posts on this forum related to this error message, but so far I'm still at sea. I can't find any consequential differences between the portions of my code that related to the likely problem object (access_ids) and the code that I started with. All I did related to the access_ids table was to remove some lines that printed problems the script finds with the table that caused it to ignore some data. Perhaps I changed a character or something while doing that, but I've looked and so far can't find anything.
The portion of the code that is producing these error messages is the following:
# Use the output functions configured above to convert the
# transaction record into a list of outputs to be emitted to
# the CSV output file.
print "Converting %s at %s to CSV" % (record["type"], record["time"])
output = [f(record, access_ids) for f in output_fns]
j = 0
while j < len(output):
os.write(csv_fd, output[j])
if j < len(output) - 1:
os.write(csv_fd, ",")
else:
os.write(csv_fd, "\n")
j += 1
convert_count += 1
print "Converted %d approved transactions to CSV format, skipped %d non-approved transactions" % (convert_count, skip_count)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: logparse.py INPUT_FILE OUTPUT_FILE [ACCESS_IDS_FILE]"
print
print " INPUT_FILE Silent post log containing transaction records (must exist)"
print " OUTPUT_FILE Filename for the CSV file to be created (must not exist, will be created)"
print " ACCESS_IDS_FILE List of Access IDs and email addresses (optional, must exist if specified)"
sys.exit(-1)
access_ids = {}
if len(sys.argv) > 3:
access_ids = load_access_ids(sys.argv[3])
convert_log(sys.argv[1], sys.argv[2], access_ids)
Line 170 is this one:
output = [f(record, access_ids) for f in output_fns]
and line 196 is this one:
convert_log(sys.argv[1], sys.argv[2], access_ids)
The access_ids definition, possibly related to the problem, is this:
def access_id_fn(record, access_ids):
if "payer_email" in record and len(record["payer_email"]) > 0:
if record["payer_email"] in access_ids:
return '"' + access_ids[record["payer_email"]] + '"'
else:
return ""
else:
return ""
AND
def load_access_ids(filename):
print "Loading Access IDs from %s..." % filename
access_ids = {}
for line in open(filename, "r"):
line = line.rstrip()
access_id, email = [s.strip() for s in line.split(None, 1)]
if not email_address.match(email):
continue
if email in access_ids:
access_ids[string.strip(email)] = string.strip(access_id)
return access_ids
Thanks in advance for any advice with this.
Dave

I'm not seeing anything right off hand, but you did mention that the log files were similar and I take that to mean that there are differences between the two.
Can you post a line from each?
I would double check the data in the log files and make sure what you think is being read in is correct. This definitely appears to me like a piece of data is being read in, but somewhere it is breaking what the code is expecting.

Related

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

Analyzing protein sequences with the ProtParam module

I'm fairly new with Biopython. Right now, I'm trying to compute protein parameters from several protein sequences (more than 100) in fasta format. However, I've found difficult to parse the sequences correctly.
This is the code im using:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
input_file = open ("/Users/matias/Documents/Python/DOE.fasta", "r")
for record in SeqIO.parse(input_file, "fasta"):
my_seq = str(record.seq)
analyse = ProteinAnalysis(my_seq)
print(analyse.molecular_weight())
But I'm getting this error message:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site- packages/Bio/SeqUtils/__init__.py", line 438, in molecular_weight
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/SeqUtils/__init__.py", line 438, in <genexpr>
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
KeyError: '\\'
Printing each sequence as string shows me every seq has a "\" at the end, but so far I haven't been able to remove it. Any ideas would be very appreciated.
That really shouldn't be there in your file, but if you can't get a clean input file, you can use my_seq = str(record.seq).rstrip('\\') to remove it at runtime.

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

Python: Can I grab the specific lines from a large file faster?

I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

Resources