Which is best practice to skip non ascii characters in mixed encoded text in python3? - elasticsearch

I was able to import a text file on an elasticsearch index in mylocal machine.
Despite using virtual environment, on the production machine is a nightmare, because I keep having errors like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I am using python3 and I personally was having less issues in python2, maybe it is just frustration of wasted couple of hours.
I can't understand why, I am not able to strip or handle non ascii chars:
I tried to import:
from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(unicode(text, encoding = "utf-8"))
using python2, no success.
back on python3:
import string
printable = set(string.printable)
''.join( filter(lambda x: x in printable, 'mixed non ascii string' )
no success
import codecs
with codecs.open(path, encoding='utf8') as f:
....
no success
tried:
# -*- coding: utf-8 -*-
no success
https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
no success ...
All of the above seems can't strip or handle the non ascii, it is very cumbersome, I keep on having following errors:
with open(path) as f:
for line in f:
line = line.replace('\n','')
el = line.split('\t')
print (el)
_id = el[0]
_source = el[1]
_name = el[2]
# _description = ''.join( filter(lambda x: x in printable, el[-1]) )
#
_description = remove_non_ascii( el[-1] )
print (_id, _source, _name, _description, setTipe( _source ) )
action = {
"_index": _indexName,
"_type": setTipe( _source ),
"_id": _source,
"_source": {
"name": _name,
"description" : _description
}
}
helpers.bulk(es, [action])
File "<stdin>", line 22, in <module>
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 194, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I would like to have a "definite" practice to handle encoding problems in python3 - I am using same scripts on different machines, and having different results...

ASCII characters are 0-255.
def remove_non_ascii(text):
ascii_characters = ""
for character in text:
if ord(character) <= 255:
ascii_characters += character
return ascii_characters

Related

How to run a transformers bert without pipeline?

I have found myself dealing with an enviroment that does not support multiprocessing. How do I run my DistillBert without transformers pipeline?
Here is code right now:
import json
import os
import sys
sys.path.append("/mnt/access")
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers.pipelines import pipeline
def lambda_handler(event, context):
print("After:",os.listdir("/mnt/access"))
tokenizer = AutoTokenizer.from_pretrained('/mnt/access/Dis_Save/')
model = AutoModelForQuestionAnswering.from_pretrained('/mnt/access/Dis_Save/')
nlp_qa = pipeline('question-answering', tokenizer=tokenizer,model=model)
context = "tra"
question = "tra"
X = nlp_qa(context=context, question=question)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
Error message I get right now:
{
"errorMessage": "[Errno 38] Function not implemented",
"errorType": "OSError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 18, in lambda_handler\n X = nlp_qa(context=context, question=question)\n",
" File \"/mnt/access/transformers/pipelines.py\", line 1776, in __call__\n features_list = [\n",
" File \"/mnt/access/transformers/pipelines.py\", line 1777, in <listcomp>\n squad_convert_examples_to_features(\n",
" File \"/mnt/access/transformers/data/processors/squad.py\", line 354, in squad_convert_examples_to_features\n with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:\n",
" File \"/var/lang/lib/python3.8/multiprocessing/context.py\", line 119, in Pool\n return Pool(processes, initializer, initargs, maxtasksperchild,\n",
" File \"/var/lang/lib/python3.8/multiprocessing/pool.py\", line 191, in __init__\n self._setup_queues()\n",
" File \"/var/lang/lib/python3.8/multiprocessing/pool.py\", line 343, in _setup_queues\n self._inqueue = self._ctx.SimpleQueue()\n",
" File \"/var/lang/lib/python3.8/multiprocessing/context.py\", line 113, in SimpleQueue\n return SimpleQueue(ctx=self.get_context())\n",
" File \"/var/lang/lib/python3.8/multiprocessing/queues.py\", line 336, in __init__\n self._rlock = ctx.Lock()\n",
" File \"/var/lang/lib/python3.8/multiprocessing/context.py\", line 68, in Lock\n return Lock(ctx=self.get_context())\n",
" File \"/var/lang/lib/python3.8/multiprocessing/synchronize.py\", line 162, in __init__\n SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n",
" File \"/var/lang/lib/python3.8/multiprocessing/synchronize.py\", line 57, in __init__\n sl = self._semlock = _multiprocessing.SemLock(\n"
]
}
Other code:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import json
import sys
sys.path.append("/mnt/access")
tokenizer = AutoTokenizer.from_pretrained("/mnt/access/Dis_Save/")
model = AutoModelForQuestionAnswering.from_pretrained("/mnt/access/Dis_Save/", return_dict=True)
def lambda_handler(event, context):
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = ["How many pretrained models are available in 🤗 Transformers?",]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs).values()
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}")
return {
'statusCode': 200,
'body': json.dumps(answer)
}
Edit:
I run the code. It runs well on it's own, however I get an error whne running on API itself:
{
"errorMessage": "'tuple' object has no attribute 'values'",
"errorType": "AttributeError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 39, in lambda_handler\n answer_start_scores, answer_end_scores = model(**inputs).values()\n"
]
}

Error remove over-representative sequences : TypeError: coercing to Unicode: need string or buffer, NoneType found

Hi I am running this python script to remove over-representative sequences from my fastq files, but I keep getting the error. I am new to bioinfomatics and have been following a fixed set of pipeline for sequence assembly. I wanted to remove over-representative sequences with this script
python /home/TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py -1 R1_1.fq -2 R1_2.fq
**Here is the error
Traceback (most recent call last):
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 46, in
leftseqs=ParseFastqcLog(opts.l_fastqc)
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 33, in ParseFastqcLog
with open(fastqclog) as fp:
TypeError: coercing to Unicode: need string or buffer, NoneType found**
Here is the script :
import sys
import gzip
from os.path import basename
import argparse
import re
from itertools import izip,izip_longest
def seqsmatch(overreplist,read):
flag=False
if overreplist!=[]:
for seq in overreplist:
if seq in read:
flag=True
break
return flag
def get_input_streams(r1file,r2file):
if r1file[-2:]=='gz':
r1handle=gzip.open(r1file,'rb')
r2handle=gzip.open(r2file,'rb')
else:
r1handle=open(r1file,'r')
r2handle=open(r2file,'r')
return r1handle,r2handle
def FastqIterate(iterable,fillvalue=None):
"Grab one 4-line fastq read at a time"
args = [iter(iterable)] * 4
return izip_longest(fillvalue=fillvalue, *args)
def ParseFastqcLog(fastqclog):
with open(fastqclog) as fp:
for result in re.findall('Overrepresented sequences(.*?)END_MODULE', fp.read(), re.S):
seqs=([i.split('\t')[0] for i in result.split('\n')[2:-1]])
return seqs
if __name__=="__main__":
parser = argparse.ArgumentParser(description="options for removing reads with over-represented sequences")
parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
parser.add_argument('-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1')
parser.add_argument('-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2')
opts = parser.parse_args()
leftseqs=ParseFastqcLog(opts.l_fastqc)
rightseqs=ParseFastqcLog(opts.r_fastqc)
r1_out=open('rmoverrep_'+basename(opts.leftreads).replace('.gz',''),'w')
r2_out=open('rmoverrep_'+basename(opts.rightreads).replace('.gz',''),'w')
r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
counter=0
failcounter=0
with r1_stream as f1, r2_stream as f2:
R1=FastqIterate(f1)
R2=FastqIterate(f2)
for entry in R1:
counter+=1
if counter%100000==0:
print "%s reads processed" % counter
head1,seq1,placeholder1,qual1=[i.strip() for i in entry]
head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()]
flagleft,flagright=seqsmatch(leftseqs,seq1),seqsmatch(rightseqs,seq2)
if True not in (flagleft,flagright):
r1_out.write('%s\n' % '\n'.join([head1,seq1,'+',qual1]))
r2_out.write('%s\n' % '\n'.join([head2,seq2,'+',qual2]))
else:
failcounter+=1
print 'total # of reads evaluated = %s' % counter
print 'number of reads retained = %s' % (counter-failcounter)
print 'number of PE reads filtered = %s' % failcounter
r1_out.close()
r2_out.close()
Maybe you already solved it, I had the same error but now is running well.
Hope this help
(1) Files we need:
usage: RemoveFastqcOverrepSequenceReads.py [-h] [-1 LEFTREADS] [-2 RIGHTREADS] [-fql L_FASTQC] [-fqr R_FASTQC
(2) Specify fastqc_data.text files that are in the fastqc output, unzip the output directory
'-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1'
'-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2'
(3) Keep the reads and the fastqc_data text in the same directory
(4) Specify the path location before each file
python RemoveFastqcOverrepSequenceReads.py
-1 ./bicho.fq.1.gz -2./bicho.fq.2.gz
-fql ./fastqc_data_bicho_1.txt -fqr ./fastqc_data_bicho_2.txt
(5) run! :)

Aiohhtp Web.response Adding headers

Here is a method ends, which despatches matches in body and number of matches in headers.
.
.
match_count = len(matches)
tot = {'total': match_count}
return web.json_response({"matches": fdata}, headers=tot)
While processing I am getting below Server error
File "/workspace/aio/server/aioenv/lib/python3.6/site-packages/aiohttp/http_writer.py", line 111, in write_headers
buf = _serialize_headers(status_line, headers)
File "aiohttp/_http_writer.pyx", line 138, in aiohttp._http_writer._serialize_headers
File "aiohttp/_http_writer.pyx", line 110, in aiohttp._http_writer.to_str
TypeError: Cannot serialize non-str key 19
Somebody please explain. tot must be a dict. as docs explain how can I convert this into str
Header cannot contain an int value. You have to convert it to a string as follows:
tot = {'total': str(match_count)}

Incorrect representation of the string in csv file

I'm on Win7, Python2.7.
Have the string.
Original view:
A. P. Møller Mærsk
UTF-8:
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'
I need to write it in csv.
Try this:
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([s])
Got this:
A. P. Møller Mærsk
Try to use UnicodeWriter:
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'.decode('utf8')
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
And got again:
A. P. Møller Mærsk
Try unicodecsv:
Again:
A. P. Møller Mærsk
What's wrong? How can I write it right?
What you see is a mojibake: bytes that represent a Unicode text encoded in one character encoding are shown in another (incompatible) character encoding.
If ''.decode('utf8') doesn't raise AttributeError then it means that you are not on Python 3 (despite what you question says). On Python 2, csv doesn't support Unicode directly, you have to encode manually:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
text = "A. P. Møller Mærsk"
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([text.encode('utf-8')])
Both UnicodeWriter and unicodecsv module should work as well if text contains uncorrupted data.
Windows assumes the default Window locale's encoding with tools like Notepad or Excel, so for UTF-8 a byte order mark (BOM, U+FEFF) must be encoded at the start of the file. Python provides an encoding for this, utf-8-sig. Note also by using #coding:utf8 and saving your source file in UTF-8, you can declare your string directly as a Unicode string. Finally, files for use with the csv module should be opened as wb on Python 2.7 or you will see problems writing newlines on Windows.
#coding:utf8
import csv
from StringIO import StringIO
import codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
# Use utf-8-sig encoding here.
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = u'A. P. Møller Mærsk' # declare as Unicode string.
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
Output:
A. P. Møller Mærsk

How to convert UTF8 byte arrays to string in lua

I have a table like this
table = {57,55,0,15,-25,139,130,-23,173,148,-24,136,158}
it is utf8 encoded byte array by php unpack function
unpack('C*',$str);
how can I convert it to utf-8 string I can read in lua?
Lua doesn't provide a direct function for turning a table of utf-8 bytes in numeric form into a utf-8 string literal. But it's easy enough to write something for this with the help of string.char:
function utf8_from(t)
local bytearr = {}
for _, v in ipairs(t) do
local utf8byte = v < 0 and (0xff + v + 1) or v
table.insert(bytearr, string.char(utf8byte))
end
return table.concat(bytearr)
end
Note that none of lua's standard functions or provided string facilities are utf-8 aware. If you try to print utf-8 encoded string returned from the above function you'll just see some funky symbols. If you need more extensive utf-8 support you'll want to check out some of the libraries mention from the lua wiki.
Here's a comprehensive solution that works for the UTF-8 character set restricted by RFC 3629:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
function utf8frompoints(...)
local chars,arg={},{...}
for i,n in ipairs(arg) do chars[i]=utf8(arg[i]) end
return table.concat(chars)
end
print(utf8frompoints(72, 233, 108, 108, 246, 32, 8364, 8212))
--> Héllö €—

Resources