Error remove over-representative sequences : TypeError: coercing to Unicode: need string or buffer, NoneType found - bioinformatics

Hi I am running this python script to remove over-representative sequences from my fastq files, but I keep getting the error. I am new to bioinfomatics and have been following a fixed set of pipeline for sequence assembly. I wanted to remove over-representative sequences with this script
python /home/TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py -1 R1_1.fq -2 R1_2.fq
**Here is the error
Traceback (most recent call last):
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 46, in
leftseqs=ParseFastqcLog(opts.l_fastqc)
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 33, in ParseFastqcLog
with open(fastqclog) as fp:
TypeError: coercing to Unicode: need string or buffer, NoneType found**
Here is the script :
import sys
import gzip
from os.path import basename
import argparse
import re
from itertools import izip,izip_longest
def seqsmatch(overreplist,read):
flag=False
if overreplist!=[]:
for seq in overreplist:
if seq in read:
flag=True
break
return flag
def get_input_streams(r1file,r2file):
if r1file[-2:]=='gz':
r1handle=gzip.open(r1file,'rb')
r2handle=gzip.open(r2file,'rb')
else:
r1handle=open(r1file,'r')
r2handle=open(r2file,'r')
return r1handle,r2handle
def FastqIterate(iterable,fillvalue=None):
"Grab one 4-line fastq read at a time"
args = [iter(iterable)] * 4
return izip_longest(fillvalue=fillvalue, *args)
def ParseFastqcLog(fastqclog):
with open(fastqclog) as fp:
for result in re.findall('Overrepresented sequences(.*?)END_MODULE', fp.read(), re.S):
seqs=([i.split('\t')[0] for i in result.split('\n')[2:-1]])
return seqs
if __name__=="__main__":
parser = argparse.ArgumentParser(description="options for removing reads with over-represented sequences")
parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
parser.add_argument('-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1')
parser.add_argument('-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2')
opts = parser.parse_args()
leftseqs=ParseFastqcLog(opts.l_fastqc)
rightseqs=ParseFastqcLog(opts.r_fastqc)
r1_out=open('rmoverrep_'+basename(opts.leftreads).replace('.gz',''),'w')
r2_out=open('rmoverrep_'+basename(opts.rightreads).replace('.gz',''),'w')
r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
counter=0
failcounter=0
with r1_stream as f1, r2_stream as f2:
R1=FastqIterate(f1)
R2=FastqIterate(f2)
for entry in R1:
counter+=1
if counter%100000==0:
print "%s reads processed" % counter
head1,seq1,placeholder1,qual1=[i.strip() for i in entry]
head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()]
flagleft,flagright=seqsmatch(leftseqs,seq1),seqsmatch(rightseqs,seq2)
if True not in (flagleft,flagright):
r1_out.write('%s\n' % '\n'.join([head1,seq1,'+',qual1]))
r2_out.write('%s\n' % '\n'.join([head2,seq2,'+',qual2]))
else:
failcounter+=1
print 'total # of reads evaluated = %s' % counter
print 'number of reads retained = %s' % (counter-failcounter)
print 'number of PE reads filtered = %s' % failcounter
r1_out.close()
r2_out.close()

Maybe you already solved it, I had the same error but now is running well.
Hope this help
(1) Files we need:
usage: RemoveFastqcOverrepSequenceReads.py [-h] [-1 LEFTREADS] [-2 RIGHTREADS] [-fql L_FASTQC] [-fqr R_FASTQC
(2) Specify fastqc_data.text files that are in the fastqc output, unzip the output directory
'-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1'
'-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2'
(3) Keep the reads and the fastqc_data text in the same directory
(4) Specify the path location before each file
python RemoveFastqcOverrepSequenceReads.py
-1 ./bicho.fq.1.gz -2./bicho.fq.2.gz
-fql ./fastqc_data_bicho_1.txt -fqr ./fastqc_data_bicho_2.txt
(5) run! :)

Related

Trying to traverse (walk) the directory of an esp32 using Micropython

```
# try.py
import uos
dir = 16384
def walk(t): # recursive function
print('-',t)
w = uos.ilistdir(t)
for x in w:
L = list(x)
print(L[0], L[1], L[3])
if L[1] == dir:
walk(L[0])
else:
return
z = uos.ilistdir()
for x in z:
L = list(x)
print(L[0], L[1], L[3])
if L[1] == dir:
walk(L[0])
```
The code stops with an error on line 7, with an error:
Output:
Traverse.py 32768 773
boot.py 32768 139
lib 16384 0
-lib
one 16384 0
-one
Traceback (most recent call last):
File "stdin", line 21, in
File "stdi>", line 12, in walk
File "<tdin", line 7, in walk
OSError: [Errno 2] ENOENT
The directory structure is:
lib
one
two
three
three.py
boot.py
main.py
one.py
Traverse.py
It seems that it stops on a directory that has no files in it
Don't have an ESP to test, but there are some problems here:
you shouldn't return if the entry is a file but instead continue, this is why it stops too soon
you should skip the current and parent directory to avoid infinite recursion
when recursing you have to prepend the top directory, that is probably the reason for the error i.e. your code calls walk('two') but there is no such directory, it has to be one/two)
you can use the walk function on the current directory so that last bit where you duplicate the implementation isn't needed.
Additionally:
iterating ilistdir returns tuples which can be indexed as well so no need to convert it into a list
and passing collections to print directly also works, no need for separate print(x[0], x[1], ...)
Adpatation, with slightly different printing of full paths so it's easier to follow:
import uos
dir_code = 16384
def walk(t):
print('-', t)
for x in uos.ilistdir(t):
print(x)
if x[1] == dir_code and x[0] != '.' and x[0] != '..':
walk(t + '/' + x[0])
walk('.')
This will still print directories twice, add all that indexing makes things hard to read. Adaptation with tuple unpacking and printing directories once:
import uos
dir_code = 16384
def walk(top):
print(top)
for name, code, _ in uos.ilistdir(top):
if code != dir_code:
print(top + '/' + name)
elif name not in ('.', '..'):
walk(top + '/' + name)
walk('.')

Analyzing protein sequences with the ProtParam module

I'm fairly new with Biopython. Right now, I'm trying to compute protein parameters from several protein sequences (more than 100) in fasta format. However, I've found difficult to parse the sequences correctly.
This is the code im using:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
input_file = open ("/Users/matias/Documents/Python/DOE.fasta", "r")
for record in SeqIO.parse(input_file, "fasta"):
my_seq = str(record.seq)
analyse = ProteinAnalysis(my_seq)
print(analyse.molecular_weight())
But I'm getting this error message:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site- packages/Bio/SeqUtils/__init__.py", line 438, in molecular_weight
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/SeqUtils/__init__.py", line 438, in <genexpr>
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
KeyError: '\\'
Printing each sequence as string shows me every seq has a "\" at the end, but so far I haven't been able to remove it. Any ideas would be very appreciated.
That really shouldn't be there in your file, but if you can't get a clean input file, you can use my_seq = str(record.seq).rstrip('\\') to remove it at runtime.

When exporting XLIFF from Xcode, how to exclude dummy strings?

I'm using the Xcode's Editor > Export For Localization... to export XLIFF file for translation
but the translations for the Main.storyboard includes a lot of unnecessary strings, mostly placeholders/dummies that are useful at design time.
How do I exclude such strings from XLIFF file?
I've written a script that excludes certain translation.
How it works?
cmd-line: python strip_loc.py input.xliff output.xliff exclude_list.txt [-v]
Example usage:python strip_loc.py en.xliff en-stripped.xliff exclude_words.txt -v
The exclude_list.txt is a file with a string per line. The script parses this list and creates a dictionary of banned words. If a translation with source containing one of these strings is encountered, the whole translation unit is removed from the output xml/xliff.
Here is the solution that works with latest python version:
def log(string_to_log):
if args.verbose:
print(string_to_log)
import argparse
parser = argparse.ArgumentParser(description="Process xliff file against banned words and output new xliff with stripped translation.", epilog="Example usage: strip_loc.py en.xliff en-stripped.xliff exclude_words.txt -v")
parser.add_argument('source', help="Input .xliff file containing all the strings")
parser.add_argument('output', help="Output .xliff file which will containt the stripped strings according to the exclude_list")
parser.add_argument('exclude_list', help="Multi-line text file where every line is a banned string")
parser.add_argument('-v', '--verbose', action="store_true", help="print script steps while working")
args = parser.parse_args()
banned_words = [line.strip().lower() for line in open(args.exclude_list, 'r')]
log("original file: " + args.source)
log("output file: " + args.output)
log("banned words: " + ", ".join(banned_words))
log("")
import xml.etree.ElementTree as ET
ET.register_namespace('',"urn:oasis:names:tc:xliff:document:1.2")
ns = {"n": "urn:oasis:names:tc:xliff:document:1.2"}
with open(args.source, 'r') as xml_file:
tree = ET.parse(xml_file)
root = tree.getroot()
counter = 1
for file_body in root.findall("./*/n:body", ns):
for trans_unit in file_body.findall("n:trans-unit", ns):
source = trans_unit.find("n:source", ns)
if source.text is not None:
source = source.text.encode("utf-8").lower()
source = source.decode("utf-8")
source = source.strip()
for banned_word in banned_words:
if source.find(banned_word) != -1:
log(str(counter) + ": removing <trans-unit id=\"" + trans_unit.attrib['id'] + "\">, banned: \"" + banned_word + "\"")
file_body.remove(trans_unit)
break
counter += 1
tree.write(args.output, "utf-8", True)
log("")
print("DONE")
And the usage is the same:
python strip_loc.py en.xliff en-stripped.xliff exclude_words.txt -v
For me I use this XLIFF Online Editor to edit the xliff file. It will be easy to you to ignore the dummy text or anything you need.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

mysterious "'str' object is not callable" python error

I am currently making my first python effort, a modification of some code written by a friend. I am using python 2.6.6. The original piece of code, which works, extracts information from a log file of data from donations made by credit card to my nonprofit. My new version, should it one day work, will perform the same task for donations that were made by paypal. The log files are similar, but have different field names and other differences.
The error messages I'm getting are:
Traceback (most recent call last):
File "../logparse-paypal-1.py", line 196, in
convert_log(sys.argv[1], sys.argv[2], access_ids)
File "../logparse-paypal-1.py", line 170, in convert_log
output = [f(record, access_ids) for f in output_fns]
TypeError: 'str' object is not callable
I've read some of the posts on this forum related to this error message, but so far I'm still at sea. I can't find any consequential differences between the portions of my code that related to the likely problem object (access_ids) and the code that I started with. All I did related to the access_ids table was to remove some lines that printed problems the script finds with the table that caused it to ignore some data. Perhaps I changed a character or something while doing that, but I've looked and so far can't find anything.
The portion of the code that is producing these error messages is the following:
# Use the output functions configured above to convert the
# transaction record into a list of outputs to be emitted to
# the CSV output file.
print "Converting %s at %s to CSV" % (record["type"], record["time"])
output = [f(record, access_ids) for f in output_fns]
j = 0
while j < len(output):
os.write(csv_fd, output[j])
if j < len(output) - 1:
os.write(csv_fd, ",")
else:
os.write(csv_fd, "\n")
j += 1
convert_count += 1
print "Converted %d approved transactions to CSV format, skipped %d non-approved transactions" % (convert_count, skip_count)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: logparse.py INPUT_FILE OUTPUT_FILE [ACCESS_IDS_FILE]"
print
print " INPUT_FILE Silent post log containing transaction records (must exist)"
print " OUTPUT_FILE Filename for the CSV file to be created (must not exist, will be created)"
print " ACCESS_IDS_FILE List of Access IDs and email addresses (optional, must exist if specified)"
sys.exit(-1)
access_ids = {}
if len(sys.argv) > 3:
access_ids = load_access_ids(sys.argv[3])
convert_log(sys.argv[1], sys.argv[2], access_ids)
Line 170 is this one:
output = [f(record, access_ids) for f in output_fns]
and line 196 is this one:
convert_log(sys.argv[1], sys.argv[2], access_ids)
The access_ids definition, possibly related to the problem, is this:
def access_id_fn(record, access_ids):
if "payer_email" in record and len(record["payer_email"]) > 0:
if record["payer_email"] in access_ids:
return '"' + access_ids[record["payer_email"]] + '"'
else:
return ""
else:
return ""
AND
def load_access_ids(filename):
print "Loading Access IDs from %s..." % filename
access_ids = {}
for line in open(filename, "r"):
line = line.rstrip()
access_id, email = [s.strip() for s in line.split(None, 1)]
if not email_address.match(email):
continue
if email in access_ids:
access_ids[string.strip(email)] = string.strip(access_id)
return access_ids
Thanks in advance for any advice with this.
Dave
I'm not seeing anything right off hand, but you did mention that the log files were similar and I take that to mean that there are differences between the two.
Can you post a line from each?
I would double check the data in the log files and make sure what you think is being read in is correct. This definitely appears to me like a piece of data is being read in, but somewhere it is breaking what the code is expecting.

Resources