UTF-8 issue with CoreNLP server - utf-8

I run a Stanford CoreNLP Server with the following command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
I try to parse the sentence Who was Darth Vader’s son?. Note that the apostrophe behind Vader is not an ASCII character.
The online demo successfully parse the sentence:
The server I run on localhost fails:
I also tried to perform the query using Python.
import requests
url = 'http://localhost:9000/'
sentence = 'Who was Darth Vader’s son?'
r=requests.post(url, params={'properties' : '{"annotators": "tokenize,ssplit,pos,ner", "outputFormat": "json"}'}, data=sentence.encode('utf8'))
tree = r.json()
The last command raises an exception:
ValueError: Invalid control character at: line 1 column 1172 (char 1171)
However, I noticed occurrences of the character \x00 in the text (i.e. r.text). If I remove them, the json parsing succeeds:
import json
tree = json.loads(r.text.replace('\x00', ''))
Finally, r.encoding is ISO-8859-1, even though I did not use the option -strict to run the server. Note that it does not change anything if I manually replace it by UTF-8.
If I run the same code replacing url = 'http://localhost:9000/' by url = 'http://corenlp.run/', then everything succeeds. The call r.json() returns a dict, r.encoding is indeed UTF-8, and no character \x00 is in the text.
What is wrong with the CoreNLP server I run?

This is a known bug with the 3.6.0 release. If you build the server from GitHub, it should work properly with UTF-8 characters. Setting the appropriate Content-Type header in the request will also fix this issue (see https://github.com/stanfordnlp/CoreNLP/issues/125).

Related

Ruby invalid multibyte char error (Sep 2019)

My script fails on this bad encoding, even I brought all files to UTF-8 but still some won't convert or just have wrong chars inside.
It fails actually on var assignment step.
Can I set some kind of error handling for this case like below so my loop will continue. That ¿ causes all problem.
Need to run this script all the way without errors. Tried already encoding und force_encoding and shebang line. Is Ruby has any kind of error handling routing so I can handle that bad case and continue with the rest of script? How to get rid of this error invalid multibyte char (UTF-8)
line = '¿USE [Alpha]'
lineOK = ' USE [Alpha] OK line'
>ruby ReadFile_Test.rb
ReadFile_Test.rb:15: invalid multibyte char (UTF-8)
I could reproduce your issue by saving the file with ISO-8859-1 encoding.
Running your code with the file in this non UTF8-encoding the error popped up. My solution was to save the file as UTF-8.
I am using Sublime as text editor and there is the option 'file > save with encoding'. I have chosen 'UTF-8' and was able to run the script.
Using puts line.encoding showed me UTF-8 then and no error anymore.
I suggest to re-check the encoding of your saved script file again.

Encoding issue on subprocess.Popen args

Yet another encoding question on Python.
How can I pass non-ASCII characters as parameters on a subprocess.Popen call?
My problem is not on the stdin/stdout as the majority of other questions on StackOverflow, but passing those characters in the args parameter of Popen.
Python script used for testing:
import subprocess
cmd = 'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'
process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
output, err = process.communicate()
result = process.wait()
print result, '-', output
For this example call, the script.py receives Testç on ã and ê. If I copy-paste this same command string on a CMD shell, it works fine.
What I've tried, besides what's described above:
Checked if all Python scripts are encoded in UTF-8. They are.
Changed to unicode (cmd = u'...'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 5 (Popen call).
Changed to cmd = u'...'.decode('utf-8'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 3 (decode call).
Changed to cmd = u'...'.encode('utf8'), results in Testç on ã and ê
Added PYTHONIOENCODING=utf-8 env. variable with no luck.
Looking on tries 2 and 3, it seems like Popen issues a decode call internally, but I don't have enough experience in Python to advance based on this suspicious.
Environment: Python 2.7.11 running on an Windows Server 2012 R2.
I've searched for similar problems but haven't found any solution. A similar question is asked in what is the encoding of the subprocess module output in Python 2.7?, but no viable solution is offered.
I read that Python 3 changed the way string and encoding works, but upgrading to Python 3 is not an option currently.
Thanks in advance.
As noted in the comments, subprocess.Popen in Python 2 is calling the Windows function CreateProcessA which accepts a byte string in the currently configured code page. Luckily Python has an encoding type mbcs which stands in for the current code page.
cmd = u'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'.encode('mbcs')
Unfortunately you can still fail if the string contains characters that can't be encoded into the current code page.

Can I stop Stanford POS and NER taggers from removing "#" and "#" characters?

I'm doing some processing with the Stanford NLP software. First of all, thanks everyone at Stanford for all this great stuff!!! Here's my conundrum:
I have a sentences that can have URLs ("http://"), email addresses ("&"), and hashtags ("#"). I'm using python to do the work. If I use the POS and NER tagging built into nltk, all these special characters are kept in their tokenized words. But it's very slow to use these since each call fires up a new java instance. So I've run the taggers in server mode instead. And when I pass the full sentences through, they come back with all those special characters stripped off. I'm using the python sner package to interface with the servers.
Here's what I mean. To use the nltk StanfordPOSTagger, you have to pass in a pre-tokenized sentence. I'm using the StanfordTokenizer.
>>>from nltk.tag.stanford import StanfordPOSTagger
>>>from nltk.tokenize import StanfordTokenizer
>>>import sner # https://pypi.python.org/pypi/sner
>>>sent="Here's an #example from me#y.ou url http://me.you"
>>>st=StanfordTokenizer(homedir+'models/stanford-postagger.jar',
options={"ptb3Ellipsis":False})
>>>nltk_pos=StanfordPOSTagger(homedir+'models/english-bidirectional-distsim.tagger',
homedir+'models/stanford-postagger.jar')
>>>pos_args=['java', '-mx300m', '-cp', homedir+'/models/stanford-postagger.jar',
edu.stanford.nlp.tagger.maxent.MaxentTaggerServer','-model',
homedir+'models/english-bidirectional-distsim.tagger','-port','2020']
>>>POS=Popen(pos_args)
>>>sp=sner(host="localhost",port=2020)
>>>nltk_pos.tag(st.tokenize(sent))
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
(u'#example', u'NN'), (u'from', u'IN'), (u'me#y.ou', u'NN'),
(u'url', u'NN'), (u'http://me.you', u'NN')]
>>>sp.tag(sent)
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
(u'example', u'NN'), (u'from', u'IN'), (u'y.ou', u'NN'),
(u'url', u'NN'), (u'//me.you', u'NN')]
I'm curious why the difference and if there is a way to get the servers to not strip out those characters? I've read that there are ways to pass flags to the POS server to use pre-tokenized text ("-tokenize false"), but I can't figure out how to pass that list of strings to the server with the python interface. In the sner package, the text to be parsed is sent as a single string, not a list of strings as is returned by a tokenizer.
-b
The problem is because of nltk use edu.stanford.nlp.process.WhitespaceTokenizer as tokenizerFactory.
You can just change ner-server parameter like this:
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false

json parser error unexpected token

I am getting a json response array as below.
"[{\"id\":\"23886\",\"item_type\":2,\"name\":\"Equalizer\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/0\\/tf1_media_ingest95290_image\\/tf1_media_ingest95290_image_0_208x277.jpg\",\"url\":\"\\/films\\/media-23886-Equalizer.html\",\"duration\":\"2h27mn\",\"durationtime\":\"8865\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12771\",\"hdbundle\":\"12771\",\"teaser\":\"23887\",\"att_getter\":\"Tout le monde a le droit \\u00e0 la justice\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_6\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"4.0\",\"criticsrating\":\"3.0\",\"onThisPf\":1},{\"id\":\"23998\",\"item_type\":2,\"name\":\"Le Labyrinthe\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/1\\/tf1_media_ingest94727_image\\/tf1_media_ingest94727_image_1_208x277.jpg\",\"url\":\"\\/films\\/media-23998-Le_Labyrinthe.html\",\"duration\":\"1h48mn\",\"durationtime\":\"6533\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12699\",\"hdbundle\":\"12699\",\"teaser\":\"23999\",\"att_getter\":\"Saurez-vous r\\u00e9chapper du labyrinthe ?\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_1\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"3.5\",\"criticsrating\":\"4.0\",\"onThisPf\":1},{\"id\":\"23688\",\"item_type\":2,\"name\":\"Gone Girl\",\"label\":null,\"desc\":null,\"genre\":null,\"show_name\":null,\"img\":\"http:\\/\\/httpg3.scdn.arkena.com\\/10242\\/v2_images\\/tf1\\/0\\/tf1_media_ingest92895_image\\/tf1_media_ingest92895_image_0_208x277.jpg\",\"url\":\"\\/films\\/media-23688-Gone_Girl.html\",\"duration\":\"2h22mn\",\"durationtime\":\"8579\",\"audio_languages\":null,\"prod\":null,\"year\":null,\"vf\":\"1\",\"vost\":\"1\",\"sd\":true,\"hd\":false,\"sdprice\":\"4.99\",\"hdprice\":null,\"sdfile\":null,\"hdfile\":null,\"sdbundle\":\"12507\",\"hdbundle\":\"12507\",\"teaser\":\"23689\",\"att_getter\":\"Il ne faut pas se fier aux apparences...\",\"orig_prod\":null,\"director\":null,\"actors\":null,\"csa\":\"CSA_2\",\"season\":null,\"episode\":null,\"typeid\":\"1\",\"isfav\":false,\"viewersrating\":\"4.0\",\"criticsrating\":\"4.5\",\"onThisPf\":1}]"
While I try to parse it, I get Unexpected token Parser Error, which I believe is due to the quotes at the beginning and end of the response.
I was wrong to say that the parser error was due to the quotes at the beginning and end of response. But I am not sure why it happens. But when I try to parse the json response array, it does throw error.
Any idea whether there is anything wrong in the json respnse array.
I tried to parse it but it throws parser error. I tried as below
JSON.parse(File.read('demo')). The demo file contains the json
response which I pasted.
First of all, the json you posted is a ruby String. And ruby parses it as json without error. However, if you paste that string into a file, it will not be valid json because of the escape sequences, the most numerous of which is \".
In a ruby string, the sequence \", which is two characters long, is converted to one character; in a file that same sequence is two characters long: a \ and a ". In other words, escape sequences that are legal inside a ruby String do not represent the same thing when pasted into a file.
Another example: in a ruby String the escape sequence \20AC is a single character--the Euro sign. However, if you paste that sequence into a file, it will be five characters long: a \, and a 2, and a 0, and an A, and a C.
Response to comment:
There is an invisible byte order mark (BOM) at the start of the json, which you can see by executing:
p resp
...which produces the output:
\xEF\xBB\xBF[{\"id\":\"2388\" .....
The UTF-8 representation of the BOM is the byte sequence
0xEF,0xBB,0xBF
Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to
signal at the start that the text stream is encoded in UTF-8.
You can skip the first 3 bytes/characters like this:
resp[3..-1]
I had this error with reading in JSON files and it turned out that the issue was that JSON.parse somehow did not like UTF-8-encoded files. When I first encoded the files to ASCII (= ISO 8859-1) everything went fine.
Try this. It works.
require 'json'
my_obj = JSON.parse("your json string", :symbolize_names => true)

Replacing "\xe9" character from a unicode string in Python 3

Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.

Resources